I think the core confusion is that outer/inner (mis)-alignment have different (reasonable) meanings which are often mixed up:
Threat models:outer misalignment and inner misalignment.
Desiderata sufficient for a particular type of proposal for AI safety: For a given AI, solve outer alignment and inner alignment (sufficiently well for a particular AI and deployment) and then combine these solutions to avoid misalignment issues.
The key thing is that threat models are not necessarily problems that need to be solved directly. For instance, AI control aims to address the threat model of inner misalignment without solving inner alignment.
Definition as threat models
Here are these terms defined as threat models
Outer misalignment is the threat model where catastrophic outcomes result from providing problematic reward signals to the AI. This includes cases where problematic feedback results in catastrophic generalization behavior as described in “Without specific countermeasures…”[1], but also slow-rolling catastrophe due to direct incentives from reward as discussed in “What failure looks like: You get what you measure”.
Inner misalignment is the threat model where catastrophic outcomes happen regardless of whether the reward process is problematic due to the AI pursuing undesirable objectives in at least some cases. This could be due to scheming (aka deceptive alignment), problematic goals which correlate with reward (aka proxy alignment with problematic generalization), or threat models more like deep deceptiveness.
This seems like a pretty reasonable decomposition of problems to me, but again note that these problems don’t have to respectively be solved by inner/outer alignment “solutions”.
Definition as a particular type of AI safety proposal
This proposal has two necessary desiderata:
Outer alignment: a reward provision process such that sufficiently good outcomes would occur if our actual AI maximized this reward to the best of its abilities[2]. Note that we only care about maximization given a specific AI’s actual capabilities and affordances, the process doesn’t need to be robust to arbitrary optimization. As such, outer alignment is with respect to a particular AI and how it is used[3]. (See also the notion of local adequacy we define here.)
Inner alignment: a process which sufficiently ensures that an AI actually does robustly “try”[4] to maximize a given reward provision process (from the prior step) including in novel circumstances[5].
This seems like generally reasonable overall proposal, though there are alternatives. And the caveats around outer alignment only needing to be locally adequate are important.
Doc with more detail
This content is copied out of this draft which I’ve never gotten around to cleaning up and publishing.
The outer misalignment threat model covers cases where problematic feedback results in training a misaligned AI even if the actual oversight process used for training would actually have caught the catastrophically bad behavior if it was applied to this action.
For AIs which aren’t well described as pursuing goals, it’s sufficient for the AI to just be reasonably well optimized to perform well according to this reward provision process. However, note that AIs which aren’t well described as pursuing goals also likely pose no misalignment risk.
Prior work hasn’t been clear about outer alignment solutions just needing to be robust to a particular AI produced by some training process, but this seems extremely key to a reasonable definition of the problem from my perspective. This is both because we don’t need to be arbitrarily robust (the key AIs will only be so smart, perhaps not even smarter than humans) and because approaches might depend on utilizing the AI itself in the oversight process (recursive oversight) such that they’re only robust to that AI but not others. For an example of recursive oversight being robust to a specific AI, consider ELK type approaches. ELK could be applicable to outer alignment via ensuring that a human overseer is well informed about everything a given AI knows (but not necessarily well informed about everything any AI could know).Os
In some circumstances, it’s unclear exactly what it would even mean to optimize a given reward provision process as the process is totally inapplicable to the novel circumstances. We’ll ignore this issue.
I think the core confusion is that outer/inner (mis)-alignment have different (reasonable) meanings which are often mixed up:
Threat models: outer misalignment and inner misalignment.
Desiderata sufficient for a particular type of proposal for AI safety: For a given AI, solve outer alignment and inner alignment (sufficiently well for a particular AI and deployment) and then combine these solutions to avoid misalignment issues.
The key thing is that threat models are not necessarily problems that need to be solved directly. For instance, AI control aims to address the threat model of inner misalignment without solving inner alignment.
Definition as threat models
Here are these terms defined as threat models
Outer misalignment is the threat model where catastrophic outcomes result from providing problematic reward signals to the AI. This includes cases where problematic feedback results in catastrophic generalization behavior as described in “Without specific countermeasures…”[1], but also slow-rolling catastrophe due to direct incentives from reward as discussed in “What failure looks like: You get what you measure”.
Inner misalignment is the threat model where catastrophic outcomes happen regardless of whether the reward process is problematic due to the AI pursuing undesirable objectives in at least some cases. This could be due to scheming (aka deceptive alignment), problematic goals which correlate with reward (aka proxy alignment with problematic generalization), or threat models more like deep deceptiveness.
This seems like a pretty reasonable decomposition of problems to me, but again note that these problems don’t have to respectively be solved by inner/outer alignment “solutions”.
Definition as a particular type of AI safety proposal
This proposal has two necessary desiderata:
Outer alignment: a reward provision process such that sufficiently good outcomes would occur if our actual AI maximized this reward to the best of its abilities[2]. Note that we only care about maximization given a specific AI’s actual capabilities and affordances, the process doesn’t need to be robust to arbitrary optimization. As such, outer alignment is with respect to a particular AI and how it is used[3]. (See also the notion of local adequacy we define here.)
Inner alignment: a process which sufficiently ensures that an AI actually does robustly “try”[4] to maximize a given reward provision process (from the prior step) including in novel circumstances[5].
This seems like generally reasonable overall proposal, though there are alternatives. And the caveats around outer alignment only needing to be locally adequate are important.
Doc with more detail
This content is copied out of this draft which I’ve never gotten around to cleaning up and publishing.
The outer misalignment threat model covers cases where problematic feedback results in training a misaligned AI even if the actual oversight process used for training would actually have caught the catastrophically bad behavior if it was applied to this action.
For AIs which aren’t well described as pursuing goals, it’s sufficient for the AI to just be reasonably well optimized to perform well according to this reward provision process. However, note that AIs which aren’t well described as pursuing goals also likely pose no misalignment risk.
Prior work hasn’t been clear about outer alignment solutions just needing to be robust to a particular AI produced by some training process, but this seems extremely key to a reasonable definition of the problem from my perspective. This is both because we don’t need to be arbitrarily robust (the key AIs will only be so smart, perhaps not even smarter than humans) and because approaches might depend on utilizing the AI itself in the oversight process (recursive oversight) such that they’re only robust to that AI but not others. For an example of recursive oversight being robust to a specific AI, consider ELK type approaches. ELK could be applicable to outer alignment via ensuring that a human overseer is well informed about everything a given AI knows (but not necessarily well informed about everything any AI could know).Os
Again, this is only important insofar as AIs are doing anything well described as “trying” in any cases.
In some circumstances, it’s unclear exactly what it would even mean to optimize a given reward provision process as the process is totally inapplicable to the novel circumstances. We’ll ignore this issue.
That’s an excellent point.
I agree. I think that’s probably a better way of clarifying the confusion that what I wrote.