I’m having trouble understanding the argument for why a “sharp left turn” would be likely. Here’s my understanding of the candidate reasons, I’d appreciate any missing considerations:
Inner Misalignment
AGI will contain optimizers. Those optimizers will not necessarily optimize for the base objective used by the outer optimization process implemented by humans. This AGI could still achieve the outer goal in training, but once deployed they could competently pursue its incorrect inner goal. Deception inner alignment is a special case, where we don’t realize the inner goal is misaligned until deployment because the system realizes it’s in training and deliberately optimizes for the outer goal until deployment. See Risks from Learned Optimization and Goal Misgeneralization in Deep Reinforcement Learning.
Question: If inner optimization is the cause of the sharp left turn, why does Nate focus on failure modes that only arise once we’ve built AGI? We already have examples of inner misalignment, and I’d expect we can work on solving inner misalignment in current systems.
Wireheading / Power Seeking
AGI might try to exercise extreme control over its reward signal, either by hacking directly into the technical system providing its reward or by seeking power in the world to better achieve its rewards. These might be more important problems when systems are more intelligent and can more successfully execute these strategies.
Question: These problems are observable and addressable in systems today. See wireheading and conservative agency. Why focus on the unique AGI case?
Capabilities Discontinuities
The fast-takeoff hypothesis. This could be caused by recursive self-improvement, but the more popular justification seems to be that intelligence is fundamentally simple in some way and will be understood very quickly once AI reaches a critical threshold. This seems closely related to the idea that “capabilities fall into an attractor and alignment doesn’t”, which I don’t fully understand. AFAICT discontinuous capabilities don’t introduce misaligned objectives, but do make them much more dangerous. Rohin’s comment above sees this as the central disagreement about whether sharp left turns are likely or not.
Question: Is this the main reason to expect a sharp left turn? If you knew that capabilities progress would be slow and continuous, would you still be concerned about left turns?
Learning without Gradient Descent
If AI is doing lots of learning in between gradient descent steps, it could invent and execute a misaligned objective before we have time to correct it. This makes several strategies for preventing misalignment less useful: truthfulness, ELK, … . What it doesn’t explain is why learning between gradient descent steps is particularly likely to be misaligned—perhaps it’s not, and the root causes of misalignment are the inner misalignment and wireheading / power-seeking mentioned above.
Question: Does this introduce new sources of misaligned objectives? Or is it simply a reason why alignment strategies that rely on gradient descent updates will fail?
Further question: The technical details of how systems will learn without gradient descent seem murky, the post only provides an analogy to human learning without evolution. This is discussed here, I’d be curious to hear any thoughts.
Are these the core arguments for a sharp left turn? What important steps are missing? Have I misunderstood any of the arguments I’ve presented? Here is another attempt to understand the arguments.
Learning without Gradient Descent—Now it is much easier to imagine learning without gradient decent. An LLM can add into its context or even save into a database knowledge, meta-cognitive strategies, code, etc.
It is very similar to value change due to inner misalignment or self improvement, except it is not literally inside the model but inside its extended cognition.
I’m having trouble understanding the argument for why a “sharp left turn” would be likely. Here’s my understanding of the candidate reasons, I’d appreciate any missing considerations:
Inner Misalignment
AGI will contain optimizers. Those optimizers will not necessarily optimize for the base objective used by the outer optimization process implemented by humans. This AGI could still achieve the outer goal in training, but once deployed they could competently pursue its incorrect inner goal. Deception inner alignment is a special case, where we don’t realize the inner goal is misaligned until deployment because the system realizes it’s in training and deliberately optimizes for the outer goal until deployment. See Risks from Learned Optimization and Goal Misgeneralization in Deep Reinforcement Learning.
Question: If inner optimization is the cause of the sharp left turn, why does Nate focus on failure modes that only arise once we’ve built AGI? We already have examples of inner misalignment, and I’d expect we can work on solving inner misalignment in current systems.
Wireheading / Power Seeking
AGI might try to exercise extreme control over its reward signal, either by hacking directly into the technical system providing its reward or by seeking power in the world to better achieve its rewards. These might be more important problems when systems are more intelligent and can more successfully execute these strategies.
Question: These problems are observable and addressable in systems today. See wireheading and conservative agency. Why focus on the unique AGI case?
Capabilities Discontinuities
The fast-takeoff hypothesis. This could be caused by recursive self-improvement, but the more popular justification seems to be that intelligence is fundamentally simple in some way and will be understood very quickly once AI reaches a critical threshold. This seems closely related to the idea that “capabilities fall into an attractor and alignment doesn’t”, which I don’t fully understand. AFAICT discontinuous capabilities don’t introduce misaligned objectives, but do make them much more dangerous. Rohin’s comment above sees this as the central disagreement about whether sharp left turns are likely or not.
Question: Is this the main reason to expect a sharp left turn? If you knew that capabilities progress would be slow and continuous, would you still be concerned about left turns?
Learning without Gradient Descent
If AI is doing lots of learning in between gradient descent steps, it could invent and execute a misaligned objective before we have time to correct it. This makes several strategies for preventing misalignment less useful: truthfulness, ELK, … . What it doesn’t explain is why learning between gradient descent steps is particularly likely to be misaligned—perhaps it’s not, and the root causes of misalignment are the inner misalignment and wireheading / power-seeking mentioned above.
Question: Does this introduce new sources of misaligned objectives? Or is it simply a reason why alignment strategies that rely on gradient descent updates will fail?
Further question: The technical details of how systems will learn without gradient descent seem murky, the post only provides an analogy to human learning without evolution. This is discussed here, I’d be curious to hear any thoughts.
Are these the core arguments for a sharp left turn? What important steps are missing? Have I misunderstood any of the arguments I’ve presented? Here is another attempt to understand the arguments.
Learning without Gradient Descent—Now it is much easier to imagine learning without gradient decent. An LLM can add into its context or even save into a database knowledge, meta-cognitive strategies, code, etc.
It is very similar to value change due to inner misalignment or self improvement, except it is not literally inside the model but inside its extended cognition.