I’m not familiar with alignment research too deeply, but it’s always been fairly intuitive to me that corrigibility can only make any kind of sense under reward uncertainty (and hence uncertainty about the optimal policy). The agent must see each correction by an external force as reducing the uncertainty of future rewards, hence the disable action is almost always suboptimal because it removes a source of information about rewards.
For instance, we could setup an environment where no rewards are ever given, the agent must maintain a distribution P(r|s,a) of possibly rewards for each state-action pair, and the only information it ever gets about rewards is an occasional “hand-of-god” handing it a∗(st), the optimal action for some state st , the agent must then work backwards from this optimal action to update P(r|s,a) . It must then reason from this updated distribution of rewards to P(π∗), the current distribution of optimal policies implied by its knowledge of rewards. Such an agent presented with an action adisable that would prevent future “hand-of-god” optimal action outputs would not choose it because that would mean not being able to further constrain P(π∗), which makes its expected future reward smaller.
Someday when I have time I want to code a small grid-world agent that actually implements something like this, to see if it works.
but it’s always been fairly intuitive to me that corrigibility can only make any kind of sense under reward uncertainty
If you do not know it already, this intuition lies at the heart of CIRL. So before you jump to coding, my recommendation is to read that paper first. You can find lots of discussion on this forum and elsewhere on why CIRL is not a perfect corrigibility solution. If I recall correctly, the paper itself also points out the limitation I feel is most fundamental: if uncertainty is reduced based on further learning, CIRL-based corrigibility is also reduced.
There are many approaches to corrigibility that do not rely on the concept of reward uncertainty, e.g. counterfactual planning and Armstrong’s indifference methods.
The agent could then manipulate whoever’s in charge of giving the “hand-of-god” optimal action.
I do think the “reducing uncertainty” captures something relevant, and turntrout’s outside view post (huh, guess I can’t make links on mobile, so here: https://www.lesswrong.com/posts/BMj6uMuyBidrdZkiD/corrigibility-as-outside-view) grounds out uncertainty to be “how wrong am I about the true reward of many different people I could be helping out?”
I’m not familiar with alignment research too deeply, but it’s always been fairly intuitive to me that corrigibility can only make any kind of sense under reward uncertainty (and hence uncertainty about the optimal policy). The agent must see each correction by an external force as reducing the uncertainty of future rewards, hence the disable action is almost always suboptimal because it removes a source of information about rewards.
For instance, we could setup an environment where no rewards are ever given, the agent must maintain a distribution P(r|s,a) of possibly rewards for each state-action pair, and the only information it ever gets about rewards is an occasional “hand-of-god” handing it a∗(st), the optimal action for some state st , the agent must then work backwards from this optimal action to update P(r|s,a) . It must then reason from this updated distribution of rewards to P(π∗), the current distribution of optimal policies implied by its knowledge of rewards. Such an agent presented with an action adisable that would prevent future “hand-of-god” optimal action outputs would not choose it because that would mean not being able to further constrain P(π∗), which makes its expected future reward smaller.
Someday when I have time I want to code a small grid-world agent that actually implements something like this, to see if it works.
If you do not know it already, this intuition lies at the heart of CIRL. So before you jump to coding, my recommendation is to read that paper first. You can find lots of discussion on this forum and elsewhere on why CIRL is not a perfect corrigibility solution. If I recall correctly, the paper itself also points out the limitation I feel is most fundamental: if uncertainty is reduced based on further learning, CIRL-based corrigibility is also reduced.
There are many approaches to corrigibility that do not rely on the concept of reward uncertainty, e.g. counterfactual planning and Armstrong’s indifference methods.
The agent could then manipulate whoever’s in charge of giving the “hand-of-god” optimal action.
I do think the “reducing uncertainty” captures something relevant, and turntrout’s outside view post (huh, guess I can’t make links on mobile, so here: https://www.lesswrong.com/posts/BMj6uMuyBidrdZkiD/corrigibility-as-outside-view) grounds out uncertainty to be “how wrong am I about the true reward of many different people I could be helping out?”