Koen.Holtman comments on Ngo and Yudkowsky on alignment difficulty

Koen.Holtman 24 Nov 2021 10:32 UTC
2 points
AF
Update: I just recalled that Eliezer and MIRI often talk about Dutch booking when they talk about coherence. So not being susceptible to Dutch booking may be the type of coherence Eliezer has in mind here.

When it comes to Dutch booking as a coherence criterion, I need to repeat again the observation I made below:

In general, when you want to think about coherence without getting deeply confused, you need to keep track of what reward function you are using to rule on your coherency criterion. I don’t see that fact mentioned often on this forum, so I will expand.

An agent that plans coherently given a reward function $R_{p}$ to maximize paperclips will be an incoherent planner if you judge its actions by a reward function $R_{s}$ that values the maximization of staples instead.

To extend this to Dutch booking: if you train a superintelligent poker playing agent with a reward function that rewards it for losing at poker, you will find that if can be Dutch booked rather easily, if your Dutch booking test is whether you can find a counter-strategy to make it loose money.