This is my punishment for being sloppy with definitions :-(
Ok, let’s be clearer. When the button is pressed, a process begins, to change u into v. But the process is stocastic, and has a small chance of failing. The failure of this is designated by u→u.
So if Pr refers to pressing the button, the expected reward is:
I’m going to translate this to my notation so I understand it better (feel free to skip this, it’s mostly for my benefit). Let X be the event that the coin flip succeeds (which usually happens). We can say that X still has a value even if ¬Pr (its value just doesn’t do anything). So u→v happens iff Pr∧X.
The expected utility is then
E[[¬Pr∨¬X]u+[Pr∧X](v+E[u|Pr∧¬X]−E[v|Pr∧X])
Suppose the AI never updates its beliefs (because it goes into stasis the first round). Then this is
E[[¬Pr∨¬X]u]+E[[Pr∧X]v]+P(Pr∧X)E[u|Pr∧¬X]−E[[Pr∧X]v]=E[[¬Pr∨¬X]u]+P(Pr∧X)E[u|Pr∧¬X]=P(¬Pr)E[u|¬Pr,¬X]+P(Pr∧X)E[u|Pr,¬X]≈P(¬Pr)E[u|¬Pr,¬X]+P(Pr)E[u|Pr,¬X]=E[u|¬X]=E[u|do(¬X)]
(I wrote it as a causal counterfactual in the last step; it doesn’t make a difference since X has no causal ancestors, but I find it easier to reason about this way)
So in the end, if the whole game is to construct a subagent and then go into stasis, then the agent is just a u-maximizer who believes ¬X (as you said). It avoids the gerrymandering problem by doing a causal counterfactual.
I think another way of framing this is that the button always succeeds, and agent optimizes E[u|do(¬Pr)] (which relates this back to the causal version of factoring out effects, where you optimize q(Pr)E[v|do(Pr)]+q(¬Pr)E[u|do(¬Pr)]; you get utility indifference with stasis by setting q(Pr)=0). This will be slightly different in that here the agent “believes” it will be magically prevented from pressing the button, whereas originally the agent just “believes” that the button will fail.
This is my punishment for being sloppy with definitions :-(
Ok, let’s be clearer. When the button is pressed, a process begins, to change u into v. But the process is stocastic, and has a small chance of failing. The failure of this is designated by u→u.
So if Pr refers to pressing the button, the expected reward is:
P(¬Pr)E(u|¬Pr)+P(Pr,u→u)E(u|Pr,u→u)+P(Pr,u→v)(E(v|Pr,u→v)+E(u|Pr,u→u)−E(v|Pr,u→v))=P(¬Pr)E(u|¬Pr)+(P(Pr,u→u)+P(Pr,u→v))E(u|Pr,u→u)=(1/2)3+(1/2)8=5.5.
This makes more sense, thanks for explaining!
I’m going to translate this to my notation so I understand it better (feel free to skip this, it’s mostly for my benefit). Let X be the event that the coin flip succeeds (which usually happens). We can say that X still has a value even if ¬Pr (its value just doesn’t do anything). So u→v happens iff Pr∧X.
The expected utility is then
E[[¬Pr∨¬X]u+[Pr∧X](v+E[u|Pr∧¬X]−E[v|Pr∧X])
Suppose the AI never updates its beliefs (because it goes into stasis the first round). Then this is
E[[¬Pr∨¬X]u]+E[[Pr∧X]v]+P(Pr∧X)E[u|Pr∧¬X]−E[[Pr∧X]v] =E[[¬Pr∨¬X]u]+P(Pr∧X)E[u|Pr∧¬X] =P(¬Pr)E[u|¬Pr,¬X]+P(Pr∧X)E[u|Pr,¬X] ≈P(¬Pr)E[u|¬Pr,¬X]+P(Pr)E[u|Pr,¬X] =E[u|¬X] =E[u|do(¬X)] (I wrote it as a causal counterfactual in the last step; it doesn’t make a difference since X has no causal ancestors, but I find it easier to reason about this way)
So in the end, if the whole game is to construct a subagent and then go into stasis, then the agent is just a u-maximizer who believes ¬X (as you said). It avoids the gerrymandering problem by doing a causal counterfactual.
I think another way of framing this is that the button always succeeds, and agent optimizes E[u|do(¬Pr)] (which relates this back to the causal version of factoring out effects, where you optimize q(Pr)E[v|do(Pr)]+q(¬Pr)E[u|do(¬Pr)]; you get utility indifference with stasis by setting q(Pr)=0). This will be slightly different in that here the agent “believes” it will be magically prevented from pressing the button, whereas originally the agent just “believes” that the button will fail.