Razied comments on A Certain Formalization of Corrigibility Is VNM-Incoherent

Razied 20 Nov 2021 1:56 UTC
4 points
AF
I’m not familiar with alignment research too deeply, but it’s always been fairly intuitive to me that corrigibility can only make any kind of sense under reward uncertainty (and hence uncertainty about the optimal policy). The agent must see each correction by an external force as reducing the uncertainty of future rewards, hence the disable action is almost always suboptimal because it removes a source of information about rewards.
For instance, we could setup an environment where no rewards are ever given, the agent must maintain a distribution $P (r | s, a)$ of possibly rewards for each state-action pair, and the only information it ever gets about rewards is an occasional “hand-of-god” handing it $a^{*} (s_{t})$ , the optimal action for some state $s_{t}$ , the agent must then work backwards from this optimal action to update $P (r | s, a)$ . It must then reason from this updated distribution of rewards to $P (π^{*})$ , the current distribution of optimal policies implied by its knowledge of rewards. Such an agent presented with an action $a_{d i s a b l e}$ that would prevent future “hand-of-god” optimal action outputs would not choose it because that would mean not being able to further constrain $P (π^{*})$ , which makes its expected future reward smaller.
Someday when I have time I want to code a small grid-world agent that actually implements something like this, to see if it works.
- Koen.Holtman 21 Nov 2021 18:25 UTC
  LW: 3 AF: 1
  AF Parent
  
  but it’s always been fairly intuitive to me that corrigibility can only make any kind of sense under reward uncertainty
  
  If you do not know it already, this intuition lies at the heart of CIRL. So before you jump to coding, my recommendation is to read that paper first. You can find lots of discussion on this forum and elsewhere on why CIRL is not a perfect corrigibility solution. If I recall correctly, the paper itself also points out the limitation I feel is most fundamental: if uncertainty is reduced based on further learning, CIRL-based corrigibility is also reduced.
  
  There are many approaches to corrigibility that do not rely on the concept of reward uncertainty, e.g. counterfactual planning and Armstrong’s indifference methods.
- Logan Riggs 20 Nov 2021 12:19 UTC
  LW: 1 AF: 1
  AF Parent
  The agent could then manipulate whoever’s in charge of giving the “hand-of-god” optimal action.
  
  I do think the “reducing uncertainty” captures something relevant, and turntrout’s outside view post (huh, guess I can’t make links on mobile, so here: https://www.lesswrong.com/posts/BMj6uMuyBidrdZkiD/corrigibility-as-outside-view) grounds out uncertainty to be “how wrong am I about the true reward of many different people I could be helping out?”