Third claim: because both subagents act-as-though the button state is not causally downstream of anything else in the universe, the agent will not attempt to manipulate the button state.
I disagree.
Suppose there is a paperclip lottery, if the AI wins, it gets loads of clips, and it becomes harder to shut down.
The AI has been entered by someone else, not playing isn’t an option.
Both subagents imagine a plan to make sure that, if they win, the button isn’t pressed.
To the paperclip AI, this is a good deal. The conditional expectation acts as a sort of average of all scenarios where the button isn’t pressed. And this scenario has more clips than most, so increases the average.
For the shutdown AI, this is an unusually bad scenario, and taking it out of their expectation also increases conditional expected utility.
So both subagents agree on a plan that conditionally manipulates the button.
Both subagents imagine a plan to make sure that, if they win, the button isn’t pressed.
I’m not seeing how it is possible for the agents to imagine that? Both of them expect that they have no influence whatsoever over whether the button is pressed, because there’s nothing in their decision-driving models which is causally upstream of the button.
There is no “AI gets control of button” option, from the perspective of either subagent. Both agents would look at option c, stick it into their do()-op on button state, and then act-as-though option C would not give any control at all over the button state.
I think you are attempting to do the math as though the do()-op were instead just a standard conditional (i.e. EDT-style rather than CDT-style)?
Yes. I was assuming a standard conditional for the button.
I can’t currently see any flaws with the CDT style. Other than.
Subagents believe in a world where buttons magically press themselves. So this design can’t make coherent statements about the probabilty that the button will be pressed. (one AI believes it’s 1, the other that it’s 0).
These AI’s have no incentive to give humans access to the button. To the AI’s, they have a magic button, that might or might not magically press its self. The AI’s have a lot of utility bet on that button. Is that button going to end up in a high security vault, surrounded by sensors and no humans. Both AI’s would like that very much. The AI’s have 0 concern about human’s pressing the button. But the AI’s have lots of concern about humans hiding the button. This design Really wants to know if the button magically presses itself. Humans could cut the wires, could stand between the button and the camera, etc.
I disagree.
Suppose there is a paperclip lottery, if the AI wins, it gets loads of clips, and it becomes harder to shut down.
The AI has been entered by someone else, not playing isn’t an option.
Both subagents imagine a plan to make sure that, if they win, the button isn’t pressed.
To the paperclip AI, this is a good deal. The conditional expectation acts as a sort of average of all scenarios where the button isn’t pressed. And this scenario has more clips than most, so increases the average.
For the shutdown AI, this is an unusually bad scenario, and taking it out of their expectation also increases conditional expected utility.
So both subagents agree on a plan that conditionally manipulates the button.
I’m not seeing how it is possible for the agents to imagine that? Both of them expect that they have no influence whatsoever over whether the button is pressed, because there’s nothing in their decision-driving models which is causally upstream of the button.
Suppose these 3 scenarios are judged as equally likely.
A) Button pressed. U_1=U_2=0
B) Button not pressed. U_1=U_2=0
C) AI gets control of button. U_1=6, U_2=-6
Suppose both agents agree that in option C, they should press the button. (this is the only 1 bit of influence they have in this toy problem)
Then agent 1 has a “certainty” of B. E(U_1|not pressed)=E(U_1|B)=0
Agent 2 has remaining uncertainty E(U_2|pressed)=1/2(E(U_2|A)+E(U_2|C))=1/2(0+-6)=-3
Change it around, make AI decide not to press the button in case C, and
E(U_1|not pressed)=3, E(U_2|pressed)=0.
There is no “AI gets control of button” option, from the perspective of either subagent. Both agents would look at option c, stick it into their do()-op on button state, and then act-as-though option C would not give any control at all over the button state.
I think you are attempting to do the math as though the do()-op were instead just a standard conditional (i.e. EDT-style rather than CDT-style)?
Yes. I was assuming a standard conditional for the button.
I can’t currently see any flaws with the CDT style. Other than.
Subagents believe in a world where buttons magically press themselves. So this design can’t make coherent statements about the probabilty that the button will be pressed. (one AI believes it’s 1, the other that it’s 0).
These AI’s have no incentive to give humans access to the button. To the AI’s, they have a magic button, that might or might not magically press its self. The AI’s have a lot of utility bet on that button. Is that button going to end up in a high security vault, surrounded by sensors and no humans. Both AI’s would like that very much. The AI’s have 0 concern about human’s pressing the button. But the AI’s have lots of concern about humans hiding the button. This design Really wants to know if the button magically presses itself. Humans could cut the wires, could stand between the button and the camera, etc.