Nice post! Two things I particularly like are the explicit iteration (demonstrating by example how and why not to only use one framing), as well as the online learning framing.
The policy behaves in a competent yet undesirable way which gets low reward according to the original reward function.[2] This is an inner alignment failure, also known as goal misgeneralization. Langosco et al. (2022) provide a more formal definition and some examples of goal misgeneralization.
It seems like a core part of this initial framing relies on the operationalisation of “competent”, yet you don’t really point to what you mean.
Notably, “competent” cannot mean “high-reward” (because of category 4) and “competent” cannot mean “desirable” (because of category 3 and 4).
Instead you point at something like “Whatever it’s incentivized to do, it’s reasonably good at accomplishing it”. I share a similar intuition, but just wanted to highlight that subtleties might hide there (maybe addressed in later framings but at least not mention at this point)
In general, we should expect that alignment failures are more likely to be in the first category when the test environment is similar to (or the same as) the training environment, as in these examples; and more likely to be in the second category when the test environment is very different from the training environment.
What comes to my mind (and is a bit mentionned after the quote) is that we could think of different hypotheses on the hardness of alignment as quantifying how similar the test environment must be to the training one to avoid inner misalignment. Potentially for harder versions of the problem, almost any difference that could tractably be detected is enough for the AI to behave differently.
I’d encourage alignment researchers to get comfortable switching between these different framings, since each helps guide our thinking in different ways. Framing 1 seems like the most useful for connecting to mainstream ML research. However, I think that focusing primarily on Framing 1 is likely to overemphasize failure modes that happen in existing systems, as opposed to more goal-directed future systems. So I tend to use Framing 2 as my main framing when thinking about alignment problems. Lastly, when it’s necessary to consider online training, I expect that the “goal robustness” version of Framing 3 will usually be easier to use than the “high-stakes/low-stakes” version, since the latter requires predicting how AI will affect the world more broadly. However, the high-stakes/low-stakes framing seems more useful when our evaluations of AGIs are intended not just for training them, but also for monitoring and verification (e.g. to shut down AGIs which misbehave).
Great conclusion! I particularly like your highlighting that each framing is more adapted to different purposes.
It seems like a core part of this initial framing relies on the operationalisation of “competent”, yet you don’t really point to what you mean. Notably, “competent” cannot mean “high-reward” (because of category 4) and “competent” cannot mean “desirable” (because of category 3 and 4). Instead you point at something like “Whatever it’s incentivized to do, it’s reasonably good at accomplishing it”.
I think here, competent can probably be defined in one of two (perhaps equivalent) ways: 1. Restricted reward spaces/informative priors over reward functions: as the appropriate folk theorem goes, any policy is optimal according to some reward function. “Most” policies are incompetent; consequently, many reward functions incentivize behavior that seems incoherent/incompetent to us. It seems that when I refer to a particular agent’s behavior as “competent”, I’m often making reference to the fact that it achieves high reward according to a “reasonable” reward function that I can imagine. Otherwise, the behavior just looks incoherent. This is similar to the definition used in Langosco, Koch, Sharkey et al’s goal misgeneralization paper, which depends on a non-trivial prior over reward functions. 2. Demonstrates instrumental convergence/power seeking behavior. In environments with regularities, certain behaviors are instrumentally convergent/power seeking. That is, they’re likely to occur for a large class of reward functions. To evaluate if behavior is competent, we can look for behavior that seem power-seeking to us (i.e., not dying in a game). Incompetent behavior is that which doesn’t exhibit power-seeking or instrumentally convergent drives.
The reason these two can be equivalent is the aforementioned folk theorem: as every policy has a reward function that rationalizes it, there exists priors over reward functions where the implied prior over optimal policies doesn’t demonstrate power seeking behavior.
Nice post! Two things I particularly like are the explicit iteration (demonstrating by example how and why not to only use one framing), as well as the online learning framing.
It seems like a core part of this initial framing relies on the operationalisation of “competent”, yet you don’t really point to what you mean. Notably, “competent” cannot mean “high-reward” (because of category 4) and “competent” cannot mean “desirable” (because of category 3 and 4). Instead you point at something like “Whatever it’s incentivized to do, it’s reasonably good at accomplishing it”. I share a similar intuition, but just wanted to highlight that subtleties might hide there (maybe addressed in later framings but at least not mention at this point)
What comes to my mind (and is a bit mentionned after the quote) is that we could think of different hypotheses on the hardness of alignment as quantifying how similar the test environment must be to the training one to avoid inner misalignment. Potentially for harder versions of the problem, almost any difference that could tractably be detected is enough for the AI to behave differently.
Great conclusion! I particularly like your highlighting that each framing is more adapted to different purposes.
I think here, competent can probably be defined in one of two (perhaps equivalent) ways:
1. Restricted reward spaces/informative priors over reward functions: as the appropriate folk theorem goes, any policy is optimal according to some reward function. “Most” policies are incompetent; consequently, many reward functions incentivize behavior that seems incoherent/incompetent to us. It seems that when I refer to a particular agent’s behavior as “competent”, I’m often making reference to the fact that it achieves high reward according to a “reasonable” reward function that I can imagine. Otherwise, the behavior just looks incoherent. This is similar to the definition used in Langosco, Koch, Sharkey et al’s goal misgeneralization paper, which depends on a non-trivial prior over reward functions.
2. Demonstrates instrumental convergence/power seeking behavior. In environments with regularities, certain behaviors are instrumentally convergent/power seeking. That is, they’re likely to occur for a large class of reward functions. To evaluate if behavior is competent, we can look for behavior that seem power-seeking to us (i.e., not dying in a game). Incompetent behavior is that which doesn’t exhibit power-seeking or instrumentally convergent drives.
The reason these two can be equivalent is the aforementioned folk theorem: as every policy has a reward function that rationalizes it, there exists priors over reward functions where the implied prior over optimal policies doesn’t demonstrate power seeking behavior.