Michael Tontchev comments on A Shutdown Problem Proposal

Michael Tontchev 3 Feb 2024 7:30 UTC
3 points
0
- Each agent finds its existence to be valuable.
- Moreover, each agent thinks it will get to decide the future.
- Each agent would want to copy itself to other systems. Of course the other agent wouldn’t allow only the first agent to be copied. But since they both think they will win, they’re happy to copy themselves together to other systems.
- The agents therefore copy themselves indefinitely.
Moreover, you claimed that they wouldn’t care about manipulating button state. But surely they care about the signal to their operating harness that relays the button state, and they don’t have certainty about that. If they think the button state is certain but the signal isn’t, they may want to manipulate that.

Ok, you say, let’s just bake the signal about the button into the problem rather than the state of the button. Ah, but in that case, when the AIs agree to copy themselves together (as we saw they would), they don’t copy over the part of the system that actually transmits the signal, since they’re convinced the signal will always magically come. And now, for the copies, the signal becomes broken in reality, and the button does nothing for the copies.
- johnswentworth 3 Feb 2024 18:28 UTC
  2 points
  0
  Parent
  Yup, the full agent (including the two subagents) is happy to copy itself. That is a feature, not a bug. The important thing is that the copy also respects the shutdown signal.
  The agent will not manipulate the signal which relays the button state except to make it more accurate, because both subagents want an accurate reading of the button state. This is because both subagents act-as-though the button itself will definitely be in their favor, and therefore any possibility of an inaccurate reading is bad for them.