Suppose that the fictional team OpenMind is training up a variety of AI systems, before one of them takes that sharp left turn. Suppose they’ve put the AI in lots of different video-game and simulated environments, and they’ve had good luck training it to pursue an objective that the operators described in English. “I don’t know what those MIRI folks were talking about; these systems are easy to direct; simple training suffices”, they say. At the same time, they apply various training methods, some simple and some clever, to cause the system to allow itself to be removed from various games by certain “operator-designated” characters in those games, in the name of shutdownability. And they use various techniques to prevent it from stripmining in Minecraft, in the name of low-impact. And they train it on a variety of moral dilemmas, and find that it can be trained to give correct answers to moral questions (such as “in thus-and-such a circumstance, should you poison the operator’s opponent?”) just as well as it can be trained to give correct answers to any other sort of question. “Well,” they say, “this alignment thing sure was easy. I guess we lucked out.”
Then, the system takes that sharp left turn,[4][5] and, predictably, the capabilities quickly improve outside of its training distribution, while the alignment falls apart.
[...]
[5] “Hold on, isn’t this unfalsifiable? Aren’t you saying that you’re going to continue believing that alignment is hard, even as we get evidence that it’s easy?” Well, I contend that “GPT can learn to answer moral questions just as well as it can learn to answer other questions” is not much evidence either way about the difficulty of alignment. I’m not saying we’ll get evidence that I’ll ignore; I’m naming in advance some things that I wouldn’t consider negative evidence (partially in hopes that I can refer back to this post when people crow later and request an update). But, yes, my model does have the inconvenient property that people who are skeptical now, are liable to remain skeptical until it’s too late, because most of the evidence I expect to give us advance warning about the nature of the problem is evidence that we’ve already seen. I assure you that I do not consider this property to be convenient.
As for things that could convince me otherwise: technical understanding of intelligence could undermine my “sharp left turn” model. I could also imagine observing some ephemeral hopefully-I’ll-know-it-when-I-see-it capabilities thresholds, without any sharp left turns, that might update me. (Short of “full superintelligence without a sharp left turn”, which would obviously convince me but comes too late in the game to shift my attention.)
I’m having trouble understanding the argument for why a “sharp left turn” would be likely. Here’s my understanding of the candidate reasons, I’d appreciate any missing considerations:
Inner Misalignment
AGI will contain optimizers. Those optimizers will not necessarily optimize for the base objective used by the outer optimization process implemented by humans. This AGI could still achieve the outer goal in training, but once deployed they could competently pursue its incorrect inner goal. Deception inner alignment is a special case, where we don’t realize the inner goal is misaligned until deployment because the system realizes it’s in training and deliberately optimizes for the outer goal until deployment. See Risks from Learned Optimization and Goal Misgeneralization in Deep Reinforcement Learning.
Question: If inner optimization is the cause of the sharp left turn, why does Nate focus on failure modes that only arise once we’ve built AGI? We already have examples of inner misalignment, and I’d expect we can work on solving inner misalignment in current systems.
Wireheading / Power Seeking
AGI might try to exercise extreme control over its reward signal, either by hacking directly into the technical system providing its reward or by seeking power in the world to better achieve its rewards. These might be more important problems when systems are more intelligent and can more successfully execute these strategies.
Question: These problems are observable and addressable in systems today. See wireheading and conservative agency. Why focus on the unique AGI case?
Capabilities Discontinuities
The fast-takeoff hypothesis. This could be caused by recursive self-improvement, but the more popular justification seems to be that intelligence is fundamentally simple in some way and will be understood very quickly once AI reaches a critical threshold. This seems closely related to the idea that “capabilities fall into an attractor and alignment doesn’t”, which I don’t fully understand. AFAICT discontinuous capabilities don’t introduce misaligned objectives, but do make them much more dangerous. Rohin’s comment above sees this as the central disagreement about whether sharp left turns are likely or not.
Question: Is this the main reason to expect a sharp left turn? If you knew that capabilities progress would be slow and continuous, would you still be concerned about left turns?
Learning without Gradient Descent
If AI is doing lots of learning in between gradient descent steps, it could invent and execute a misaligned objective before we have time to correct it. This makes several strategies for preventing misalignment less useful: truthfulness, ELK, … . What it doesn’t explain is why learning between gradient descent steps is particularly likely to be misaligned—perhaps it’s not, and the root causes of misalignment are the inner misalignment and wireheading / power-seeking mentioned above.
Question: Does this introduce new sources of misaligned objectives? Or is it simply a reason why alignment strategies that rely on gradient descent updates will fail?
Further question: The technical details of how systems will learn without gradient descent seem murky, the post only provides an analogy to human learning without evolution. This is discussed here, I’d be curious to hear any thoughts.
Are these the core arguments for a sharp left turn? What important steps are missing? Have I misunderstood any of the arguments I’ve presented? Here is another attempt to understand the arguments.
Learning without Gradient Descent—Now it is much easier to imagine learning without gradient decent. An LLM can add into its context or even save into a database knowledge, meta-cognitive strategies, code, etc.
It is very similar to value change due to inner misalignment or self improvement, except it is not literally inside the model but inside its extended cognition.
From A central AI alignment problem: capabilities generalization, and the sharp left turn:
I’m having trouble understanding the argument for why a “sharp left turn” would be likely. Here’s my understanding of the candidate reasons, I’d appreciate any missing considerations:
Inner Misalignment
AGI will contain optimizers. Those optimizers will not necessarily optimize for the base objective used by the outer optimization process implemented by humans. This AGI could still achieve the outer goal in training, but once deployed they could competently pursue its incorrect inner goal. Deception inner alignment is a special case, where we don’t realize the inner goal is misaligned until deployment because the system realizes it’s in training and deliberately optimizes for the outer goal until deployment. See Risks from Learned Optimization and Goal Misgeneralization in Deep Reinforcement Learning.
Question: If inner optimization is the cause of the sharp left turn, why does Nate focus on failure modes that only arise once we’ve built AGI? We already have examples of inner misalignment, and I’d expect we can work on solving inner misalignment in current systems.
Wireheading / Power Seeking
AGI might try to exercise extreme control over its reward signal, either by hacking directly into the technical system providing its reward or by seeking power in the world to better achieve its rewards. These might be more important problems when systems are more intelligent and can more successfully execute these strategies.
Question: These problems are observable and addressable in systems today. See wireheading and conservative agency. Why focus on the unique AGI case?
Capabilities Discontinuities
The fast-takeoff hypothesis. This could be caused by recursive self-improvement, but the more popular justification seems to be that intelligence is fundamentally simple in some way and will be understood very quickly once AI reaches a critical threshold. This seems closely related to the idea that “capabilities fall into an attractor and alignment doesn’t”, which I don’t fully understand. AFAICT discontinuous capabilities don’t introduce misaligned objectives, but do make them much more dangerous. Rohin’s comment above sees this as the central disagreement about whether sharp left turns are likely or not.
Question: Is this the main reason to expect a sharp left turn? If you knew that capabilities progress would be slow and continuous, would you still be concerned about left turns?
Learning without Gradient Descent
If AI is doing lots of learning in between gradient descent steps, it could invent and execute a misaligned objective before we have time to correct it. This makes several strategies for preventing misalignment less useful: truthfulness, ELK, … . What it doesn’t explain is why learning between gradient descent steps is particularly likely to be misaligned—perhaps it’s not, and the root causes of misalignment are the inner misalignment and wireheading / power-seeking mentioned above.
Question: Does this introduce new sources of misaligned objectives? Or is it simply a reason why alignment strategies that rely on gradient descent updates will fail?
Further question: The technical details of how systems will learn without gradient descent seem murky, the post only provides an analogy to human learning without evolution. This is discussed here, I’d be curious to hear any thoughts.
Are these the core arguments for a sharp left turn? What important steps are missing? Have I misunderstood any of the arguments I’ve presented? Here is another attempt to understand the arguments.
Learning without Gradient Descent—Now it is much easier to imagine learning without gradient decent. An LLM can add into its context or even save into a database knowledge, meta-cognitive strategies, code, etc.
It is very similar to value change due to inner misalignment or self improvement, except it is not literally inside the model but inside its extended cognition.
Thanks!