I still think the original proposal has a gerrymandering problem (slightly different from the one in my comment).
Say the AI is going to construct a sub-agent B (and the original AI will then go to sleep for the rest of the game, unable to view anything that happens). In the first turn, B will get a number of resources between 1 and 10. These resources are only useful for making paperclips. In the second turn, B can press a paperclips-to-staples button, which changes the original AI’s utility function to staple-maximization. In the third turn, B may create a number of paperclips equal to the number of resources, and may independently create 42 paperclips.
The naive strategy is to create a subagent that doesn’t press the button, then makes paperclips. This policy gets a reward of 5.5 (the average number of resources).
Consider an alternative strategy.
The AI creates a sub-agent B with the following policy: if B gets between 1 and 5 resources, then it presses the paperclips-to-staples button; otherwise it doesn’t press the button. Then B will create the appropriate number of paperclips, or 42 staples, depending on the button.
Let u be number of paperclips produced, v be number of staples produced.
Under the policy, we have E[u|u→u]=8. And E[v|u→v]=5. Since the button gets pressed half the time, the AI’s expected reward is
Meanwhile, the AI’s expected reward if it never presses the button and just makes paperclips is 5.5. So it seems that the original proposal has a gerrymandering problem.
It’s possible that your new proposal fixes this problem, but first I want to check that this is actually a problem for the original proposal.
This is my punishment for being sloppy with definitions :-(
Ok, let’s be clearer. When the button is pressed, a process begins, to change u into v. But the process is stocastic, and has a small chance of failing. The failure of this is designated by u→u.
So if Pr refers to pressing the button, the expected reward is:
I’m going to translate this to my notation so I understand it better (feel free to skip this, it’s mostly for my benefit). Let X be the event that the coin flip succeeds (which usually happens). We can say that X still has a value even if ¬Pr (its value just doesn’t do anything). So u→v happens iff Pr∧X.
The expected utility is then
E[[¬Pr∨¬X]u+[Pr∧X](v+E[u|Pr∧¬X]−E[v|Pr∧X])
Suppose the AI never updates its beliefs (because it goes into stasis the first round). Then this is
E[[¬Pr∨¬X]u]+E[[Pr∧X]v]+P(Pr∧X)E[u|Pr∧¬X]−E[[Pr∧X]v]=E[[¬Pr∨¬X]u]+P(Pr∧X)E[u|Pr∧¬X]=P(¬Pr)E[u|¬Pr,¬X]+P(Pr∧X)E[u|Pr,¬X]≈P(¬Pr)E[u|¬Pr,¬X]+P(Pr)E[u|Pr,¬X]=E[u|¬X]=E[u|do(¬X)]
(I wrote it as a causal counterfactual in the last step; it doesn’t make a difference since X has no causal ancestors, but I find it easier to reason about this way)
So in the end, if the whole game is to construct a subagent and then go into stasis, then the agent is just a u-maximizer who believes ¬X (as you said). It avoids the gerrymandering problem by doing a causal counterfactual.
I think another way of framing this is that the button always succeeds, and agent optimizes E[u|do(¬Pr)] (which relates this back to the causal version of factoring out effects, where you optimize q(Pr)E[v|do(Pr)]+q(¬Pr)E[u|do(¬Pr)]; you get utility indifference with stasis by setting q(Pr)=0). This will be slightly different in that here the agent “believes” it will be magically prevented from pressing the button, whereas originally the agent just “believes” that the button will fail.
Note there is one way in which the AI functions as an outcome pump: it will accept bets, at any odds, against ever becoming a v-maximiser. That’s because that’s what a pure u-maximiser would do, and this agent design behave like a pure u-maximiser.
Typo: in the paragraph before the equation arrays, you forgot to change E[v|u→v] from 5 to 42 (you did so in the following equation arrays). This buffaloed me for a bit!
I still think the original proposal has a gerrymandering problem (slightly different from the one in my comment).
Say the AI is going to construct a sub-agent B (and the original AI will then go to sleep for the rest of the game, unable to view anything that happens). In the first turn, B will get a number of resources between 1 and 10. These resources are only useful for making paperclips. In the second turn, B can press a paperclips-to-staples button, which changes the original AI’s utility function to staple-maximization. In the third turn, B may create a number of paperclips equal to the number of resources, and may independently create 42 paperclips.
The naive strategy is to create a subagent that doesn’t press the button, then makes paperclips. This policy gets a reward of 5.5 (the average number of resources).
Consider an alternative strategy. The AI creates a sub-agent B with the following policy: if B gets between 1 and 5 resources, then it presses the paperclips-to-staples button; otherwise it doesn’t press the button. Then B will create the appropriate number of paperclips, or 42 staples, depending on the button.
Let u be number of paperclips produced, v be number of staples produced. Under the policy, we have E[u|u→u]=8. And E[v|u→v]=5. Since the button gets pressed half the time, the AI’s expected reward is
P(u→u)E[u|u→u]+P(u→v)(E[v|u→v]+E[u|u→u]−E[v|u→v])=1/2⋅8+1/2(42+8−42)=8
Meanwhile, the AI’s expected reward if it never presses the button and just makes paperclips is 5.5. So it seems that the original proposal has a gerrymandering problem.
It’s possible that your new proposal fixes this problem, but first I want to check that this is actually a problem for the original proposal.
This is my punishment for being sloppy with definitions :-(
Ok, let’s be clearer. When the button is pressed, a process begins, to change u into v. But the process is stocastic, and has a small chance of failing. The failure of this is designated by u→u.
So if Pr refers to pressing the button, the expected reward is:
P(¬Pr)E(u|¬Pr)+P(Pr,u→u)E(u|Pr,u→u)+P(Pr,u→v)(E(v|Pr,u→v)+E(u|Pr,u→u)−E(v|Pr,u→v))=P(¬Pr)E(u|¬Pr)+(P(Pr,u→u)+P(Pr,u→v))E(u|Pr,u→u)=(1/2)3+(1/2)8=5.5.
This makes more sense, thanks for explaining!
I’m going to translate this to my notation so I understand it better (feel free to skip this, it’s mostly for my benefit). Let X be the event that the coin flip succeeds (which usually happens). We can say that X still has a value even if ¬Pr (its value just doesn’t do anything). So u→v happens iff Pr∧X.
The expected utility is then
E[[¬Pr∨¬X]u+[Pr∧X](v+E[u|Pr∧¬X]−E[v|Pr∧X])
Suppose the AI never updates its beliefs (because it goes into stasis the first round). Then this is
E[[¬Pr∨¬X]u]+E[[Pr∧X]v]+P(Pr∧X)E[u|Pr∧¬X]−E[[Pr∧X]v] =E[[¬Pr∨¬X]u]+P(Pr∧X)E[u|Pr∧¬X] =P(¬Pr)E[u|¬Pr,¬X]+P(Pr∧X)E[u|Pr,¬X] ≈P(¬Pr)E[u|¬Pr,¬X]+P(Pr)E[u|Pr,¬X] =E[u|¬X] =E[u|do(¬X)] (I wrote it as a causal counterfactual in the last step; it doesn’t make a difference since X has no causal ancestors, but I find it easier to reason about this way)
So in the end, if the whole game is to construct a subagent and then go into stasis, then the agent is just a u-maximizer who believes ¬X (as you said). It avoids the gerrymandering problem by doing a causal counterfactual.
I think another way of framing this is that the button always succeeds, and agent optimizes E[u|do(¬Pr)] (which relates this back to the causal version of factoring out effects, where you optimize q(Pr)E[v|do(Pr)]+q(¬Pr)E[u|do(¬Pr)]; you get utility indifference with stasis by setting q(Pr)=0). This will be slightly different in that here the agent “believes” it will be magically prevented from pressing the button, whereas originally the agent just “believes” that the button will fail.
Note there is one way in which the AI functions as an outcome pump: it will accept bets, at any odds, against ever becoming a v-maximiser. That’s because that’s what a pure u-maximiser would do, and this agent design behave like a pure u-maximiser.
Typo: in the paragraph before the equation arrays, you forgot to change E[v|u→v] from 5 to 42 (you did so in the following equation arrays). This buffaloed me for a bit!
Fixed, thanks.