TurnTrout comments on Ngo and Yudkowsky on alignment difficulty

TurnTrout 20 Nov 2021 0:37 UTC
LW: 15 AF: 11
AF
Apparently no one has actually shown that corrigibility can be VNM-incoherent in any precise sense (and not in the hand-wavy sense which is good for intuition-pumping). I went ahead and sketched out a simple proof of how a reasonable kind of corrigibility gives rise to formal VNM incoherence.
I’m interested in hearing about how your approach handles this environment, because I think I’m getting lost in informal assumptions and symbol-grounding issues when reading about your proposed method.
- Koen.Holtman 21 Nov 2021 14:51 UTC
  LW: 10 AF: 6
  AF Parent
  Read your post, here are my initial impressions on how it relates to the discussion here.
  
  In your post, you aim to develop a crisp mathematical definition of (in)coherence, i.e. VNM-incoherence. I like that, looks like a good way to move forward. Definitely, developing the math further has been my own approach to de-confusing certain intuitive notions about what should be possible or not with corrigibility.
  
  However, my first impression is that your concept of VNM-incoherence is only weakly related to the meaning that Eliezer has in mind when he uses the term incoherence. In my view, the four axioms of VNM-rationality have only a very weak descriptive and constraining power when it comes to defining rational behavior. I believe that Eliezer’s notion of rationality, and therefore his notion of coherence above, goes far beyond that implied by the axioms of VNM-rationality. My feeling is that Eliezer is using the term ‘coherence constraints’ an intuition-pump way where coherence implies, or almost always implies, that a coherent agent will develop the incentive to self-preserve.
  
  Looking at your post, I am also having trouble telling exactly how you are defining VNM-incoherence. You seem to be toying with several alternative definitions, one where it applies to reward functions (or preferences over lotteries) which are only allowed to examine the final state in a 10-step trajectory, another where the reward function can examine the entire trajectory and maybe the actions taken to produce that trajectory. I think that your proof only works in the first case, but fails in the second case. This has certain (fairly trivial) corollaries about building corrigibility. I’ll expand on this in a comment I plan to attach to your post.
  
  I’m interested in hearing about how your approach handles this environment,
  
  I think one way to connect your ABC toy environment to my approach is to look at sections 3 and 4 of my earlier paper where I develop a somewhat similar clarifying toy environment, with running code.
  
  Another comment I can make is that your ABC nodes-and-arrows state transition diagram is a depiction which makes it hard see how to apply my approach, because the depiction mashes up the state of the world outside of the compute core and the state of the world inside the compute core. If you want to apply counterfactual planning, or if you want to have a an agent design that can compute the balancing function terms according to Armstrong’s indifference approach, you need a different depiction of your setup. You need one which separates out these two state components more explicitely. For example, make an MDP model where the individual states are instances of the tuple (physical position of agent in the ABC playing field,policy function loaded into the compute core).
  
  Not sure how to interpret your statement that you got lost in symbol-grounding issues. If you can expand on this, I might be able to help.
  What links here?
  - Koen.Holtman's comment on A Certain Formalization of Corrigibility Is VNM-Incoherent by TurnTrout (21 Nov 2021 18:14 UTC; 1 point)
- Koen.Holtman 24 Nov 2021 10:32 UTC
  2 points
  AF Parent
  Update: I just recalled that Eliezer and MIRI often talk about Dutch booking when they talk about coherence. So not being susceptible to Dutch booking may be the type of coherence Eliezer has in mind here.
  
  When it comes to Dutch booking as a coherence criterion, I need to repeat again the observation I made below:
  
  In general, when you want to think about coherence without getting deeply confused, you need to keep track of what reward function you are using to rule on your coherency criterion. I don’t see that fact mentioned often on this forum, so I will expand.
  
  An agent that plans coherently given a reward function $R_{p}$ to maximize paperclips will be an incoherent planner if you judge its actions by a reward function $R_{s}$ that values the maximization of staples instead.
  
  To extend this to Dutch booking: if you train a superintelligent poker playing agent with a reward function that rewards it for losing at poker, you will find that if can be Dutch booked rather easily, if your Dutch booking test is whether you can find a counter-strategy to make it loose money.