Stuart_Armstrong comments on The best value indifference method (so far)

Stuart_Armstrong 21 Dec 2016 16:22 UTC
0 points
AF
Suppose the humans have already decided whether to press the shutdown or order the AI to maximise paperclips. If $o_{s}$ is the observation of the shutdown command and $o_{p}$ the observation of the paperclip maximising command, and $u_{s}$ and $u_{p}$ the relevant utilities, then $P$ can be defined as $P (u_{s} | h_{m - 1} o_{s}) = 1$ and $P (u_{p} | h_{m - 1} o_{p}) = 1$ , for all histories $h_{m - 1}$ .

Then define $ˆ P$ as the probability of $o_{s}$ versus $o_{p}$ , conditional on the fact that the agent follows a particular deterministic policy $π^{0}$ .

If the agent does indeed follow $π^{0}$ , then $ˆ P = {ˆ P}^{'}$ . If it varies from this policy, then ${ˆ P}^{'}$ is altered in proportion to the expected change in $ˆ P$ caused by choosing a different action.