Suppose the humans have already decided whether to press the shutdown or order the AI to maximise paperclips. If os is the observation of the shutdown command and op the observation of the paperclip maximising command, and us and up the relevant utilities, then P can be defined as P(us|hm−1os)=1 and P(up|hm−1op)=1, for all histories hm−1.
Then define ˆP as the probability of os versus op, conditional on the fact that the agent follows a particular deterministic policy π0.
If the agent does indeed follow π0, then ˆP=ˆP′. If it varies from this policy, then ˆP′ is altered in proportion to the expected change in ˆP caused by choosing a different action.
Suppose the humans have already decided whether to press the shutdown or order the AI to maximise paperclips. If os is the observation of the shutdown command and op the observation of the paperclip maximising command, and us and up the relevant utilities, then P can be defined as P(us|hm−1os)=1 and P(up|hm−1op)=1, for all histories hm−1.
Then define ˆP as the probability of os versus op, conditional on the fact that the agent follows a particular deterministic policy π0.
If the agent does indeed follow π0, then ˆP=ˆP′. If it varies from this policy, then ˆP′ is altered in proportion to the expected change in ˆP caused by choosing a different action.