Doomimir: This is all very interesting, but I don’t think it bears much on the reasons we’re all going to die. It’s all still on the “is” side of the is–ought gap. What makes intelligence useful—and dangerous—isn’t a fixed repertoire of behaviors. It’s search, optimization—the systematic discovery of new behaviors to achieve goals despite a changing environment. I don’t think recent capabilities advances bear on the shape of the alignment challenge because being able to learn complex behavior on the training distribution was never what the problem was about.
It’s not really search per se that’s dangerous. It’s the world model that you use for the search. If that model is rich enough to assist the search, yet poor enough to have poor feedback, then when you search over it you get unacceptable side-effects. The trick that solves safe AI is to have a model with enough structure that algorithmic searches over it can solve important problems while also having that structure be human-interpretable enough that we can correctly specify goals we want to achieve, rather than to roll the dice with unknown side-effects.
When you set out poisoned ant baits, you likely don’t think of yourself as trying to deceive the ants, but you are. Similarly, a smart AI won’t think of itself as trying to deceive us. It’s trying to achieve its goals. If its plans happen to involve emitting sound waves or character sequences that we interpret as claims about the world, that’s our problem.
I’m kind of giving a spoiler to WIP post on how to solve alignment in writing this, but I’ve been procrastinating so much on it that I might as well:
When you set out poisoned ant baits, you do think of yourself as trying to kill the ants. This is plausibly the primary effect of putting out the ant bait! Other plausible big effects would be “supporting companies that create ant baits”, “killing other creeps”, “doing various kinds of pollution”, and “making it visible to other people that you put out ant baits”.
But if you were trying to not have any effect on the ants, it would be convergent for you to avoid deceiving them. In fact, the ant poisons I saw in my childhood tends to have warnings on it specifically to avoid having humans accidentally consume it and be harmed. (Though looking it up now, it appears less intensive ant baits are used, which don’t need warnings? Due to environmentalism maybe? Idk.)
The big question is whether any self-supervised models will expose enough structure that you can rely on this sort of reasoning for building your capabilities. I think alignment research should bet “yes”, at least to the point where it wants to develop such models to the point where they are useful.
Doomimir: [starting to anger] Simplicia Optimistovna, if you weren’t from Earth, I’d say I don’t think you’re trying to understand. I never claimed that GPT-4 in particular is what you would call deceptively aligned. Endpoints are easier to predict than intermediate trajectories. I’m talking about what will happen inside almost any sufficiently powerful AGI, by virtue of it being sufficiently powerful.
Let’s say you want to build a fusion power plant.
A sufficiently powerful way to do this would be to take over the world and make the entire world optimize for building a fusion power plant.
However, “building a fusion power plant” would not be the primary effect of taking over the world; instead some sort of dictatorial scheme, or perhaps hypnodrones or whatever, would be the primary effect. The fusion power plant would be some secondary effect.
“Do whatever it takes to achieve X” is evil and sufficiently noncomposable that it is not instrumentally convergent, so it seems plausible that it won’t be favored by capabilities researchers. Admittedly, current reinforcement learning research does seem to be under the “do whatever it takes to achieve X” paradigm, but alignment research focused on making X more palatable instead of on foundations to do something more minimal seems misguided. Counterproductive, even, since making X sufficiently good doesn’t seem feasible, yet this makes it more tempting to just do whatever it takes anyway.
Doomimir: [cooler] Basically, I think you’re systematically failing to appreciate how things that have been optimized to look good to you can predictably behave differently in domains where they haven’t been optimized to look good to you—particularly, when they’re doing any serious optimization of their own. You mention the video game agent that navigates to the right instead of collecting a coin. You claim that it’s not surprising given the training set-up, and can be fixed by appropriately diversifying the training data. But could you have called the specific failure in advance, rather than in retrospect? When you enter the regime of transformatively powerful systems, you do have to call it in advance.
Doomimir: For now. But any system that does powerful cognitive work will do so via retargetable general-purpose search algorithms, which, by virtue of their retargetability, need to have something more like a “goal slot”. Your gradient updates point in the direction of more consequentialism.
I don’t think this is true because whenever the AIs solve a goal with a bunch of unintended side-effects, this is gonna rank low on the preferences, so the gradient updates would way more consistently point in the direction bounded consequentialism rather than unbounded consequentialism.
Human raters pressing the thumbs-up button on actions that look good to them are going to make mistakes. Your gradient updates point in the direction of “playing the training game”—modeling the training process that actually provides reinforcement, rather than internalizing the utility function that Earthlings naïvely hoped the training process would point to. I’m very, very confident that any AI produced via anything remotely like the current paradigm is not going to end up wanting what we want, even if it’s harder to say exactly when it will go off the rails or what it will want instead.
But I’m not sure how to reconcile that with the empirical evidence that deep networks are robust to massive label noise: you can train on MNIST digits with twenty wrong labels for every correct one and still get good performance as long as the correct label is slightly more common than the most common wrong label. If I extrapolate that to the frontier AIs of tomorrow, why doesn’t that predict that biased human reward ratings should result in a small performance reduction, rather than … death?
The noise in the MNIST case is random. Random noise is the easiest form of noise to remove and so it seems silly to update too hard on such an experiment.
One thing I should maybe emphasize which my above comment maybe doesn’t make clear enough is that “GPTs do imitation learning, which is safe” and “we should do bounded optimization rather than unbounded optimization” are two independent, mostly-unrelated points. More on the latter point is coming up in a post I’m writing, whereas more of my former point is available in links like this.
It is late at night, I can’t think clearly, and I may disavow whatever I say right now later on. But your comment that you link to is incredible and contains content that zogs rather than zigs or zags from my perspective and I’m going to re-visit when I can think good.
I also want to flag that I have been enjoying your comments when I see them on this site, and find them novel, inquisitive and well-written. Thank you.
It’s not really search per se that’s dangerous. It’s the world model that you use for the search. If that model is rich enough to assist the search, yet poor enough to have poor feedback, then when you search over it you get unacceptable side-effects. The trick that solves safe AI is to have a model with enough structure that algorithmic searches over it can solve important problems while also having that structure be human-interpretable enough that we can correctly specify goals we want to achieve, rather than to roll the dice with unknown side-effects.
I’m kind of giving a spoiler to WIP post on how to solve alignment in writing this, but I’ve been procrastinating so much on it that I might as well:
When you set out poisoned ant baits, you do think of yourself as trying to kill the ants. This is plausibly the primary effect of putting out the ant bait! Other plausible big effects would be “supporting companies that create ant baits”, “killing other creeps”, “doing various kinds of pollution”, and “making it visible to other people that you put out ant baits”.
But if you were trying to not have any effect on the ants, it would be convergent for you to avoid deceiving them. In fact, the ant poisons I saw in my childhood tends to have warnings on it specifically to avoid having humans accidentally consume it and be harmed. (Though looking it up now, it appears less intensive ant baits are used, which don’t need warnings? Due to environmentalism maybe? Idk.)
The big question is whether any self-supervised models will expose enough structure that you can rely on this sort of reasoning for building your capabilities. I think alignment research should bet “yes”, at least to the point where it wants to develop such models to the point where they are useful.
Let’s say you want to build a fusion power plant.
A sufficiently powerful way to do this would be to take over the world and make the entire world optimize for building a fusion power plant.
However, “building a fusion power plant” would not be the primary effect of taking over the world; instead some sort of dictatorial scheme, or perhaps hypnodrones or whatever, would be the primary effect. The fusion power plant would be some secondary effect.
“Do whatever it takes to achieve X” is evil and sufficiently noncomposable that it is not instrumentally convergent, so it seems plausible that it won’t be favored by capabilities researchers. Admittedly, current reinforcement learning research does seem to be under the “do whatever it takes to achieve X” paradigm, but alignment research focused on making X more palatable instead of on foundations to do something more minimal seems misguided. Counterproductive, even, since making X sufficiently good doesn’t seem feasible, yet this makes it more tempting to just do whatever it takes anyway.
Issue is when GPTs fail to generalize, they lose their capabilities, not just their alignment, because their capabilities originate from mimicking humans.
I don’t think this is true because whenever the AIs solve a goal with a bunch of unintended side-effects, this is gonna rank low on the preferences, so the gradient updates would way more consistently point in the direction bounded consequentialism rather than unbounded consequentialism.
Reward is not the optimization target.
The noise in the MNIST case is random. Random noise is the easiest form of noise to remove and so it seems silly to update too hard on such an experiment.
One thing I should maybe emphasize which my above comment maybe doesn’t make clear enough is that “GPTs do imitation learning, which is safe” and “we should do bounded optimization rather than unbounded optimization” are two independent, mostly-unrelated points. More on the latter point is coming up in a post I’m writing, whereas more of my former point is available in links like this.
It is late at night, I can’t think clearly, and I may disavow whatever I say right now later on. But your comment that you link to is incredible and contains content that zogs rather than zigs or zags from my perspective and I’m going to re-visit when I can think good.
I also want to flag that I have been enjoying your comments when I see them on this site, and find them novel, inquisitive and well-written. Thank you.