Towards_Keeperhood comments on A Shutdown Problem Proposal

Towards_Keeperhood 21 Jan 2024 21:42 UTC
10 points
0
To clarify:
Subagent2 assumes the shutdown button will be pressed independent of upstream events, aka also independent of whether the shutdown button got removed. It would not veto against the removal of the shutdown problem (right?). The hope is that subagent1 doesn’t bother to remove the shutdown button because it believes it won’t be pressed anyway.
(The usefulness of subagent2 comes from it preventing the creation of pure u1 maximizers, it does not help protecting the shutdown button (right?).)
Question 1:
I may be wrong about this, but iirc MIRI considered a case where the agent assigned probability 0 to the shutdown button being pressed, and iirc the problem was that the agent would use the button as an outcome pump by creating mechanisms that would cause the button to be pressed if something went worse than expected, thereby magically decreasing the probability that something goes badly, since the button cannot be pressed.
I haven’t thought much about it, but doesn’t this proposal have the same failure mode? (And if not, why not?)
Question 2:
(Feel free to ignore this question if the proposal does fall for the failure mode in question1.)
Is there a probabilistic extension of the proposal which uses 99.99% instead of 100%, but still basically works. (Maybe assuming away some problems with comparing utility functions or so.) And if not: Why not / What is the part that relies on certainty?
- johnswentworth 21 Jan 2024 21:58 UTC
  9 points
  0
  Parent
  Subagent2 assumes the shutdown button will be pressed independent of upstream events, aka also independent of whether the shutdown button got removed. It would not veto against the removal of the shutdown problem (right?). The hope is that subagent1 doesn’t bother to remove the shutdown button because it believes it won’t be pressed anyway.
  (The usefulness of subagent2 comes from it preventing the creation of pure u1 maximizers, it does not help protecting the shutdown button (right?).)
  That’s correct.
  I may be wrong about this, but iirc MIRI considered a case where the agent assigned probability 0 to the shutdown button being pressed, and iirc the problem was that the agent would use the button as an outcome pump by creating mechanisms that would cause the button to be pressed if something went worse than expected, thereby magically decreasing the probability that something goes badly, since the button cannot be pressed.
  I haven’t thought much about it, but doesn’t this proposal have the same failure mode? (And if not, why not?)
  You remember correctly. This proposal avoids that failure mode via the do()-ops (More generally, that failure mode is an evidential-decision-theory sort of problem, which can be avoided by using causal-style decision theories, as this proposal does.) In particular, note that such outcome pumps require “mechanisms which would cause the button to be pressed”; because of the do()-ops, the agent acts-as-though there is no mechanism which would cause the button to be pressed.
  Of course that means this proposal likely has the usual failure modes of causal decision theories (e.g. in Newcomb-style problems); presumably the way to handle that is via a functional-decision-theory style approach, i.e. counterfacting over all simulated instances of the shutdown button.
  - Towards_Keeperhood 22 Jan 2024 18:10 UTC
    1 point
    0
    Parent
    Thanks.
    I briefly looked into the MIRI paper (and the section from Eliezer’s lecture that starts at 22min) again.
    My main guess now is that you’re not supposed to make the agent act on false beliefs; that one of the desiderada of the shutdown problem is that your agent shouldn’t have a capability penalty, including taking bad actions based on false beliefs (like taking bad bets, as jessicata mentioned). Your proposal violates that.
    The case MIRI considered wasn’t to modify the world model so the agent believes that the button cannot be pressed, but instead to construct the utility function in a way that the resulting behavior ends up equivalent, but without the agent having wrong beliefs.