Let’s ignore that, in reality, no probability is truly 1 (or 0). ↩︎
That can be fixed by conditioning on the set of probabilities you’re working with. (P(H | Heads or Tails) = 1⁄2, etc.)
It wants to avoid going to 99%, because then it would be committed to cake-maximising (“take their answer as the truth”).
A learning process that can’t be manipulated sounds like a good goal. Something that doesn’t want to manipulate a process sounds better/more fundamental. (Also, giving something more points for killing people than making cake sounds like a bad incentive scheme.)
Well, assume the asked guard said “cake”. Now the AI knows that one of the following is true:
That guard is a truth-teller, and humans prefer cake.
That guard is a liar, and humans prefer death.
It has entirely ruled out:
That guard is a truth-teller, and humans prefer death.
That guard is a liar, and humans prefer cake.
It’s also possible that what humans prefer varies between humans. (Corresponding handling: The AI asks both guards individually if they want cake, and if they want to die, and gives them what they say they want. The lying guard dies, the honest guard gets cake.)
Also, giving something more points for killing people than making cake sounds like a bad incentive scheme.
In the original cake-or-death example, it wasn’t that killing got more points, it’s that killing is easier (and hence gets more points over time). This is a reflection of the fact that “true” human values are complex and difficult to maximise, but many other values are much easier to maximise.
That can be fixed by conditioning on the set of probabilities you’re working with. (P(H | Heads or Tails) = 1⁄2, etc.)
A learning process that can’t be manipulated sounds like a good goal. Something that doesn’t want to manipulate a process sounds better/more fundamental. (Also, giving something more points for killing people than making cake sounds like a bad incentive scheme.)
It’s also possible that what humans prefer varies between humans. (Corresponding handling: The AI asks both guards individually if they want cake, and if they want to die, and gives them what they say they want. The lying guard dies, the honest guard gets cake.)
In the original cake-or-death example, it wasn’t that killing got more points, it’s that killing is easier (and hence gets more points over time). This is a reflection of the fact that “true” human values are complex and difficult to maximise, but many other values are much easier to maximise.