Yudkowsky mentions this briefly in the middle of the dialogue:
I don’t know however if I should be explaining at this point why “manipulate humans” is convergent, why “conceal that you are manipulating humans” is convergent, why you have to train in safe regimes in order to get safety in dangerous regimes (because if you try to “train” at a sufficiently unsafe level, the output of the unaligned system deceives you into labeling it incorrectly and/or kills you before you can label the outputs), or why attempts to teach corrigibility in safe regimes are unlikely to generalize well to higher levels of intelligence and unsafe regimes (qualitatively new thought processes, things being way out of training distribution, and, the hardest part to explain, corrigibility being “anti-natural” in a certain sense that makes it incredibly hard to, eg, exhibit any coherent planning behavior (“consistent utility function”) which corresponds to being willing to let somebody else shut you off, without incentivizing you to actively manipulate them to shut you off).
Basically, there are reasons to expect that alignment techniques that work in smaller safe regime fail in larger, unsafe regimes. For example, an alignment technique that requires your system demonstrate undesirable behavior while running could remain safe while your system is weak, but then become dangerous when undesirable behavior from your system becomes powerful.
I’m familiar with these claims, and (I believe) the principle supporting arguments that have been made publicly. I think I understand them reasonably well.
I don’t find them decisive. Some aren’t even particularly convincing. A few points:
- EY sets up a false dichotomy between “train in safe regimes” and “train in dangerous regimes”. In the approaches under discussion there is an ongoing effort (e.g. involving some form(s) of training) to align the system, and the proposal is to keep this effort ahead of advances in capability (in some sense).
- The first 2 claims for why corrigibility wouldn’t generalize seem to prove too much—why would intelligence generalize to “qualitatively new thought processes, things being way out of training distribution”, but corrigibility would not?
- I think the last claim—that corrigibility is “anti-natural”—is more compelling. However, we don’t actually understand the space of possible utility functions and agent designs well enough for it to be that strong. We know that any behavior is compatible with a utility function, so I would interpret Eliezer’s claim as relating to the complexity of description length of utility functions that encode corrigible behavior. Work on incentives suggests that removing manipulation incentives might add very little complexity to the description of the utility function, for an AI system that already understands the world well. Humans also seem to find it simple enough to add the “without manipulation” qualifier to an objective.
why would intelligence generalize to “qualitatively new thought processes, things being way out of training distribution”, but corrigibility would not?
This sounds confused to me: the intelligence is the “qualitatively new thought processes”. The thought processes aren’t some new regime that intelligence has to generalize to. Also to answer the question directly, I think the claim is that intelligence (which I’d say is synonymous for these purposes with capability) is simpler and more natural than corrigibility (i.e., the last claim—I don’t think these three claims are to be taken separately).
We know that any behavior is compatible with a utility function
People keep saying this but it seems false to me. I’ve seen the construction for history-based utility functions that’s supposed to show this, and don’t find it compelling—it seems not to be engaging with what EY is getting at with “coherent planning behavior”. Is there a construction for (environment)-state-based utility functions? I’m not saying that is exactly the right formalism to demonstrate the relationship between coherent behaviour and utility functions, but it seems to me closer to the spirit of what EY is getting at. (This comment thread on the topic seems pretty unresolved to me.)
Yudkowsky mentions this briefly in the middle of the dialogue:
Basically, there are reasons to expect that alignment techniques that work in smaller safe regime fail in larger, unsafe regimes. For example, an alignment technique that requires your system demonstrate undesirable behavior while running could remain safe while your system is weak, but then become dangerous when undesirable behavior from your system becomes powerful.
That being said, Ajeya’s “Case for Aligning Narrowly Superhuman models” does flesh out the case for trying to align existing systems (as capabilities scale).
If you know of a reference to, or feel like expaining in some detail, the arguments given (in parentheses) for this claim, I’d love to hear them!
I’m familiar with these claims, and (I believe) the principle supporting arguments that have been made publicly. I think I understand them reasonably well.
I don’t find them decisive. Some aren’t even particularly convincing. A few points:
- EY sets up a false dichotomy between “train in safe regimes” and “train in dangerous regimes”. In the approaches under discussion there is an ongoing effort (e.g. involving some form(s) of training) to align the system, and the proposal is to keep this effort ahead of advances in capability (in some sense).
- The first 2 claims for why corrigibility wouldn’t generalize seem to prove too much—why would intelligence generalize to “qualitatively new thought processes, things being way out of training distribution”, but corrigibility would not?
- I think the last claim—that corrigibility is “anti-natural”—is more compelling. However, we don’t actually understand the space of possible utility functions and agent designs well enough for it to be that strong. We know that any behavior is compatible with a utility function, so I would interpret Eliezer’s claim as relating to the complexity of description length of utility functions that encode corrigible behavior. Work on incentives suggests that removing manipulation incentives might add very little complexity to the description of the utility function, for an AI system that already understands the world well. Humans also seem to find it simple enough to add the “without manipulation” qualifier to an objective.
This sounds confused to me: the intelligence is the “qualitatively new thought processes”. The thought processes aren’t some new regime that intelligence has to generalize to. Also to answer the question directly, I think the claim is that intelligence (which I’d say is synonymous for these purposes with capability) is simpler and more natural than corrigibility (i.e., the last claim—I don’t think these three claims are to be taken separately).
People keep saying this but it seems false to me. I’ve seen the construction for history-based utility functions that’s supposed to show this, and don’t find it compelling—it seems not to be engaging with what EY is getting at with “coherent planning behavior”. Is there a construction for (environment)-state-based utility functions? I’m not saying that is exactly the right formalism to demonstrate the relationship between coherent behaviour and utility functions, but it seems to me closer to the spirit of what EY is getting at. (This comment thread on the topic seems pretty unresolved to me.)