johnswentworth comments on A Shutdown Problem Proposal

johnswentworth 23 Jan 2024 17:16 UTC
LW: 4 AF: 3
2
AF
Both subagents imagine a plan to make sure that, if they win, the button isn’t pressed.
I’m not seeing how it is possible for the agents to imagine that? Both of them expect that they have no influence whatsoever over whether the button is pressed, because there’s nothing in their decision-driving models which is causally upstream of the button.
- Donald Hobson 23 Jan 2024 18:11 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Suppose these 3 scenarios are judged as equally likely.
  A) Button pressed. U_1=U_2=0
  B) Button not pressed. U_1=U_2=0
  C) AI gets control of button. U_1=6, U_2=-6
  Suppose both agents agree that in option C, they should press the button. (this is the only 1 bit of influence they have in this toy problem)
  Then agent 1 has a “certainty” of B. E(U_1|not pressed)=E(U_1|B)=0
  Agent 2 has remaining uncertainty E(U_2|pressed)=1/2(E(U_2|A)+E(U_2|C))=1/2(0+-6)=-3
  Change it around, make AI decide not to press the button in case C, and
  E(U_1|not pressed)=3, E(U_2|pressed)=0.
  - johnswentworth 23 Jan 2024 19:15 UTC
    LW: 2 AF: 2
    0
    AF Parent
    There is no “AI gets control of button” option, from the perspective of either subagent. Both agents would look at option c, stick it into their do()-op on button state, and then act-as-though option C would not give any control at all over the button state.
    I think you are attempting to do the math as though the do()-op were instead just a standard conditional (i.e. EDT-style rather than CDT-style)?
    - Donald Hobson 23 Jan 2024 21:53 UTC
      LW: 7 AF: 4
      2
      AF Parent
      Yes. I was assuming a standard conditional for the button.
      I can’t currently see any flaws with the CDT style. Other than.
      Subagents believe in a world where buttons magically press themselves. So this design can’t make coherent statements about the probabilty that the button will be pressed. (one AI believes it’s 1, the other that it’s 0).
      These AI’s have no incentive to give humans access to the button. To the AI’s, they have a magic button, that might or might not magically press its self. The AI’s have a lot of utility bet on that button. Is that button going to end up in a high security vault, surrounded by sensors and no humans. Both AI’s would like that very much. The AI’s have 0 concern about human’s pressing the button. But the AI’s have lots of concern about humans hiding the button. This design Really wants to know if the button magically presses itself. Humans could cut the wires, could stand between the button and the camera, etc.