During a workshop with MIRI at the FHI, I defined indifference via reward signals, saying something along the lines of “we can do it with proper utilities, but its more complicated”. I then never got round to defining them in terms of utilities.
I’ll do that now in this note.
Consider an AI that we want to (potentially) transition between utility u and utility v. Let Press be the event that we press the button to change the AI’s utility; let u→v be the event that the change goes through (typically we’d want P(u→v|Press)=1−ϵ for some small ϵ).
Let IPress and Iu→v be the indicator functions for those events. Then we can define the AI’s utility as:
(1−IPressIu→v)u+IPressIu→v(v+C).
Here, C are the compensatory rewards C=E(u|Press,¬(u→v))−E(v|Press,u→v).
Thus the AI maximises u conditional on the button not being pressed or the utility change not going through. It maximises v conditional on the button being pressed and the utility change going through. The compensatory rewards are there simply to make it behave like a pure u maximiser up until the moment of button pressing.
Indifference utility functions
A putative new idea for AI control; index here.
During a workshop with MIRI at the FHI, I defined indifference via reward signals, saying something along the lines of “we can do it with proper utilities, but its more complicated”. I then never got round to defining them in terms of utilities.
I’ll do that now in this note.
Consider an AI that we want to (potentially) transition between utility u and utility v. Let Press be the event that we press the button to change the AI’s utility; let u→v be the event that the change goes through (typically we’d want P(u→v|Press)=1−ϵ for some small ϵ).
Let IPress and Iu→v be the indicator functions for those events. Then we can define the AI’s utility as:
(1−IPressIu→v)u+IPressIu→v(v+C).
Here, C are the compensatory rewards C=E(u|Press,¬(u→v))−E(v|Press,u→v).
Thus the AI maximises u conditional on the button not being pressed or the utility change not going through. It maximises v conditional on the button being pressed and the utility change going through. The compensatory rewards are there simply to make it behave like a pure u maximiser up until the moment of button pressing.