Hmm… I’m finding that I’m unable to write down a simple shutdown problem in this framework (e.g. an environment where it should switch between maximizing paperclips and shutting down) to analyze what this algorithm does. To know what the algorithm does, I need to know what P and ^P are (since these are parameters of the algorithm). From those I can derive P′ and ^P′ to determine the agent’s action. But at the moment I have no way of proceeding, since I don’t know what P and ^P are. Can you get me unstuck?
Suppose the humans have already decided whether to press the shutdown or order the AI to maximise paperclips. If os is the observation of the shutdown command and op the observation of the paperclip maximising command, and us and up the relevant utilities, then P can be defined as P(us|hm−1os)=1 and P(up|hm−1op)=1, for all histories hm−1.
Then define ˆP as the probability of os versus op, conditional on the fact that the agent follows a particular deterministic policy π0.
If the agent does indeed follow π0, then ˆP=ˆP′. If it varies from this policy, then ˆP′ is altered in proportion to the expected change in ˆP caused by choosing a different action.
Hmm… I’m finding that I’m unable to write down a simple shutdown problem in this framework (e.g. an environment where it should switch between maximizing paperclips and shutting down) to analyze what this algorithm does. To know what the algorithm does, I need to know what P and ^P are (since these are parameters of the algorithm). From those I can derive P′ and ^P′ to determine the agent’s action. But at the moment I have no way of proceeding, since I don’t know what P and ^P are. Can you get me unstuck?
Suppose the humans have already decided whether to press the shutdown or order the AI to maximise paperclips. If os is the observation of the shutdown command and op the observation of the paperclip maximising command, and us and up the relevant utilities, then P can be defined as P(us|hm−1os)=1 and P(up|hm−1op)=1, for all histories hm−1.
Then define ˆP as the probability of os versus op, conditional on the fact that the agent follows a particular deterministic policy π0.
If the agent does indeed follow π0, then ˆP=ˆP′. If it varies from this policy, then ˆP′ is altered in proportion to the expected change in ˆP caused by choosing a different action.