Another way I might frame this is that corrigibility isn’t just about what actions we want the AI to choose, it’s about what policies we want the AI to choose.
For any policy, of course, you can always ask “What actions would this policy recommend in the real world? So, wouldn’t we be happy if the AI just picked those?” Or “What utility functions over universe-histories would produce this best sequence of actions? So, wouldn’t one of those be good?”
And if you could compute those in some way other than thinking about what we want from the policy that the AI chooses to implement, be my guest. But my point is that corrigibility is a grab-bag of different things people want from AI, and some of those things are pretty directly about things we want from the policy (in that they talk about what the agent would do in multiple possible cases, they don’t just list what the agent will do in the one best case).
Big agree.
Another way I might frame this is that corrigibility isn’t just about what actions we want the AI to choose, it’s about what policies we want the AI to choose.
For any policy, of course, you can always ask “What actions would this policy recommend in the real world? So, wouldn’t we be happy if the AI just picked those?” Or “What utility functions over universe-histories would produce this best sequence of actions? So, wouldn’t one of those be good?”
And if you could compute those in some way other than thinking about what we want from the policy that the AI chooses to implement, be my guest. But my point is that corrigibility is a grab-bag of different things people want from AI, and some of those things are pretty directly about things we want from the policy (in that they talk about what the agent would do in multiple possible cases, they don’t just list what the agent will do in the one best case).