I’m familiar with these claims, and (I believe) the principle supporting arguments that have been made publicly. I think I understand them reasonably well.
I don’t find them decisive. Some aren’t even particularly convincing. A few points:
- EY sets up a false dichotomy between “train in safe regimes” and “train in dangerous regimes”. In the approaches under discussion there is an ongoing effort (e.g. involving some form(s) of training) to align the system, and the proposal is to keep this effort ahead of advances in capability (in some sense).
- The first 2 claims for why corrigibility wouldn’t generalize seem to prove too much—why would intelligence generalize to “qualitatively new thought processes, things being way out of training distribution”, but corrigibility would not?
- I think the last claim—that corrigibility is “anti-natural”—is more compelling. However, we don’t actually understand the space of possible utility functions and agent designs well enough for it to be that strong. We know that any behavior is compatible with a utility function, so I would interpret Eliezer’s claim as relating to the complexity of description length of utility functions that encode corrigible behavior. Work on incentives suggests that removing manipulation incentives might add very little complexity to the description of the utility function, for an AI system that already understands the world well. Humans also seem to find it simple enough to add the “without manipulation” qualifier to an objective.
why would intelligence generalize to “qualitatively new thought processes, things being way out of training distribution”, but corrigibility would not?
This sounds confused to me: the intelligence is the “qualitatively new thought processes”. The thought processes aren’t some new regime that intelligence has to generalize to. Also to answer the question directly, I think the claim is that intelligence (which I’d say is synonymous for these purposes with capability) is simpler and more natural than corrigibility (i.e., the last claim—I don’t think these three claims are to be taken separately).
We know that any behavior is compatible with a utility function
People keep saying this but it seems false to me. I’ve seen the construction for history-based utility functions that’s supposed to show this, and don’t find it compelling—it seems not to be engaging with what EY is getting at with “coherent planning behavior”. Is there a construction for (environment)-state-based utility functions? I’m not saying that is exactly the right formalism to demonstrate the relationship between coherent behaviour and utility functions, but it seems to me closer to the spirit of what EY is getting at. (This comment thread on the topic seems pretty unresolved to me.)
I’m familiar with these claims, and (I believe) the principle supporting arguments that have been made publicly. I think I understand them reasonably well.
I don’t find them decisive. Some aren’t even particularly convincing. A few points:
- EY sets up a false dichotomy between “train in safe regimes” and “train in dangerous regimes”. In the approaches under discussion there is an ongoing effort (e.g. involving some form(s) of training) to align the system, and the proposal is to keep this effort ahead of advances in capability (in some sense).
- The first 2 claims for why corrigibility wouldn’t generalize seem to prove too much—why would intelligence generalize to “qualitatively new thought processes, things being way out of training distribution”, but corrigibility would not?
- I think the last claim—that corrigibility is “anti-natural”—is more compelling. However, we don’t actually understand the space of possible utility functions and agent designs well enough for it to be that strong. We know that any behavior is compatible with a utility function, so I would interpret Eliezer’s claim as relating to the complexity of description length of utility functions that encode corrigible behavior. Work on incentives suggests that removing manipulation incentives might add very little complexity to the description of the utility function, for an AI system that already understands the world well. Humans also seem to find it simple enough to add the “without manipulation” qualifier to an objective.
This sounds confused to me: the intelligence is the “qualitatively new thought processes”. The thought processes aren’t some new regime that intelligence has to generalize to. Also to answer the question directly, I think the claim is that intelligence (which I’d say is synonymous for these purposes with capability) is simpler and more natural than corrigibility (i.e., the last claim—I don’t think these three claims are to be taken separately).
People keep saying this but it seems false to me. I’ve seen the construction for history-based utility functions that’s supposed to show this, and don’t find it compelling—it seems not to be engaging with what EY is getting at with “coherent planning behavior”. Is there a construction for (environment)-state-based utility functions? I’m not saying that is exactly the right formalism to demonstrate the relationship between coherent behaviour and utility functions, but it seems to me closer to the spirit of what EY is getting at. (This comment thread on the topic seems pretty unresolved to me.)