Here’s my best quickly-written story of why I expect AGI to understand human goals, but not share it. The intended audience is mostly myself, so I use personal jargon.
What a system at AGI level wants depends a lot on how it coheres its different goal-like instincts during self-reflection. Introspection & (less strongly) neuroscience tells us humans start with very similar-to-each-other internal impulses and self-reflection processes and also end up with similar goals. AI has a quite different set of genes/architecture and environment/training data, and so we can’t expect it to have as similar internal impulses and metacognition. Instead it acquires an unknown different set of internals that is easy to pick up with gradient descent and enables becoming a general intelligence.
All smart agents do eventually learn natural abstractions, e.g. algebra or what the humans living in the outside world want, and can use that knowledge, e.g. during a treacherous turn. But the internal impulses aren’t pushed towards a natural abstraction, as there’s no unique universal solution, though there are local attractors. Instead they depend on more subtleties in the architecture & training data. Also, the difference between AI and human internals might not be visible early in its behavior, because
a) we do select the internals such that during training the AI isn’t visibly misaligned in its behavior (+whatever interpretability we have).
b) the behaviorial differences we do spot may be both explained by different internal motivations, and by a lack of capabilities.
c) even if we don’t train against visible misalignment, similar instrumental pressures may apply in humans and AI, leading to partially similar behavior.
Without a better theory of what distinguishes goal-like from capability-like learned cognitive and behavioral patterns, it’s not straightforward to formalize similarity between goal-like instincts at low capability levels, similar to how AIs that don’t behave robustly goal-directed can’t be assigned a coherent goal but are perhaps better described with shard theory. Without a better theory of how AGIs will do self-reflection and metacognition, it’s not clear what sets of internal impulses during early training will later cohere into a safe goal. And it’s also not clear to me how to actually know a goal is safe.
In particular, I don’t think that using AI assistants to solve the alignment problem will work, as investigating metacognition probably requires AGI-level capabilities. Instead we just get a series of AI assistants that successfully train away visible misalignment in each other, using their knowledge of what the external humans want, until finally one AI or group of AIs realizes that a treacherous turn has become the better action.
Mayyybe there will be a stage during which the AI assistants recognize the flaw in this alignment solution, and have not yet cohered their impulses in a way that leads to misaligned behavior. In that case, the AI assistants may warn us, giving us a very late, and easily ignored, fire alarm.
Here’s my best quickly-written story of why I expect AGI to understand human goals, but not share it. The intended audience is mostly myself, so I use personal jargon.
What a system at AGI level wants depends a lot on how it coheres its different goal-like instincts during self-reflection. Introspection & (less strongly) neuroscience tells us humans start with very similar-to-each-other internal impulses and self-reflection processes and also end up with similar goals. AI has a quite different set of genes/architecture and environment/training data, and so we can’t expect it to have as similar internal impulses and metacognition. Instead it acquires an unknown different set of internals that is easy to pick up with gradient descent and enables becoming a general intelligence.
All smart agents do eventually learn natural abstractions, e.g. algebra or what the humans living in the outside world want, and can use that knowledge, e.g. during a treacherous turn. But the internal impulses aren’t pushed towards a natural abstraction, as there’s no unique universal solution, though there are local attractors. Instead they depend on more subtleties in the architecture & training data. Also, the difference between AI and human internals might not be visible early in its behavior, because a) we do select the internals such that during training the AI isn’t visibly misaligned in its behavior (+whatever interpretability we have). b) the behaviorial differences we do spot may be both explained by different internal motivations, and by a lack of capabilities. c) even if we don’t train against visible misalignment, similar instrumental pressures may apply in humans and AI, leading to partially similar behavior.
Without a better theory of what distinguishes goal-like from capability-like learned cognitive and behavioral patterns, it’s not straightforward to formalize similarity between goal-like instincts at low capability levels, similar to how AIs that don’t behave robustly goal-directed can’t be assigned a coherent goal but are perhaps better described with shard theory. Without a better theory of how AGIs will do self-reflection and metacognition, it’s not clear what sets of internal impulses during early training will later cohere into a safe goal. And it’s also not clear to me how to actually know a goal is safe.
In particular, I don’t think that using AI assistants to solve the alignment problem will work, as investigating metacognition probably requires AGI-level capabilities. Instead we just get a series of AI assistants that successfully train away visible misalignment in each other, using their knowledge of what the external humans want, until finally one AI or group of AIs realizes that a treacherous turn has become the better action.
Mayyybe there will be a stage during which the AI assistants recognize the flaw in this alignment solution, and have not yet cohered their impulses in a way that leads to misaligned behavior. In that case, the AI assistants may warn us, giving us a very late, and easily ignored, fire alarm.
Arguably, though, we do have the beginnings of theories of metacognition, see e.g. A Theoretical Understanding of Self-Correction through In-context Alignment.