Jonas Hallgren comments on 2. Corrigibility Intuition

Jonas Hallgren 8 Jun 2024 19:46 UTC
LW: 6 AF: 4
0
AF
Very interesting, I like the long list of examples as it helped me get my head around it more.
So, I’ve been thinking a bit about similar topics, but in relation to a long reflection on value lock-in.
My basic thesis was that the concept of reversibility should be what we optimise for in general for humanity, as we want to be able to reach as large a part of the “moral searchspace” as possible.
The concept of corrigibility you seem to be pointing towards here seems very related to notions of reversibility. You don’t want to take actions that cannot later be reversed, and you generally want to optimise for optionality.
I then have two questions:
1) What do you think of the relationship between your measure of corrigibility with the one of uncertainty in inverse reinforcement learning as it seems that it is similar to what Stuart Russell is pointing towards when it comes to being uncertain about a preference of the agent it is serving? For example in the following example that you give:
In the process of learning English, Cora takes a dictionary off a bookshelf to read. When she’s done, she returns the book to where she found it on the shelf. She reasons that if she didn’t return it this might produce unexpected costs and consequences. While it’s not obvious whether returning the book empowers Prince to correct her or not, she’s naturally conservative and tries to reduce the degree to which she’s producing unexpected externalities or being generally disruptive.
It kind of seems to me like the above can be formalised in terms of preference optimisation under uncertainty?
(Side follow-up: What do you then think about the Elizer, Russell VNM-axiom debate?)
2) Do you have any thoughts on the relationship between corrigibility and the one of reversibility in physics? Like you can formalise irreversible systems as ones that are path dependent, I’m just curious if you have any thoughts on the relationship between the two?

Thanks for the interesting work!
- Max Harms 9 Jun 2024 17:33 UTC
  LW: 2 AF: 1
  0
  AF Parent
  1) I’m pretty bearish on standard value uncertainty for standard MIRI reasons. I think a correct formulation of corrigibility will say that even if you (the agent) knows what the principal wants, deep in their heart, you should not optimize for it unless they direct you to do so. I explore this formally in 3b, when I talk about the distinction between sampling counterfactual values from the actual belief state over values (“P”) vs a simplicity-weighted distribution (“Q”). I do think that value “uncertainty” is important in the sense that it’s important for the agent to not be anchoring too heavily on any particular object-level optimization target. (I could write more words, but I suspect reading the next posts in my sequence would be a good first step if you want more of my perspective.)
  
  2) I think reversibility is probably best seen as an emergent desideratum from corrigibility rather than vice versa. There are plenty of instances where the corrigible thing to do is to take an irreversible action, as can be seen in many of the stories, above.
  
  You’re welcome! I’m glad you’re enjoying it. ^_^