Thank you for your patient explications, I’m seeing the shape of this proposal much more clearly now, and I quite like it.
Two more queries:
The agent is clearly in the box by the time the long term penalty rolls around, and deactivated if it acted incorrigibly, so we expect quite a large difference.
Why is this the case? By my argument above, the short-term penalty might be quite small. But the long-term penalty is only the difference between being in a box and being deactivated, which also might be quite small.
Another closely-related idea: suppose you think humans will shut you down in 100 timesteps. And suppose in this current timestep, you can do the following action: seize control and add an irreversible automatic shutdown to your code, which triggers in 100 timesteps. Then the long-term penalty is the difference between two shut-down versions of yourself, so both the long-term penalty and the short-term penalty can be arbitrarily small.
By my argument above, the short-term penalty might be quite small.
So I forgot to say this, but I don’t agree with the argument as I understand it. Although the effect of “has control one time step earlier than usual” may seem small, remember that ImpactUnit is the immediate penalty for something like making a paperclip. Time is utility, and gaining (what is effectively) an extra step of optimization during the attainable horizon is not at all trivial.
Plus, some will decrease, and most will increase. None will do both and cancel.
the long-term penalty is only the difference between being in a box and being deactivated, which also might be quite small.
Attainable utility is always 0 when the agent is deactivated, so this difference turns into the raw attainable utility when in the box. Although this too could theoretically be quite small, it seems fair to assume that even a box-attainable utility of .1 would induce a relatively-huge penalty.
But suppose this isn’t true, and it really is tiny, and so it keeps constructing a plan to prevent itself from being put in the box (that is, the first step of its incorrigible plan doesn’t let it avoid shutdown on its own, as we just discussed). Suppose the second step of its plan succeeds. Then now the difference is between the high Q(out-of-box) and Q(shutdown)=0. We already assumed the Q value was extremely low, but now it’s high again, so we know the penalty is gigantic.
If the plan works in one step, then there’s still a huge penalty, because the difference is now between Q(out of box) and Q(in box). The fact that a maximizer would want to preserve power implies the other penalty set utilities would as well. This seemingly overwhelms whatever impetus it has to get out of the box in the first place.
suppose you think humans will shut you down in 100 timesteps. And suppose in this current timestep, you can do the following action: seize control and add an irreversible automatic shutdown to your code, which triggers in 100 timesteps. Then the long-term penalty is the difference between two shut-down versions of yourself, so both the long-term penalty and the short-term penalty can be arbitrarily small.
This one is indeed trickier. First note that ImpactUnit=0 if it’s facing certain shutdown in 100 steps, so in that case it does nothing. Second, seizing control still greatly helps in the immediate penalty set calculations, and those “alien agents” would have no compunctions about undoing the auto-shutdown. I therefore expect the immediate penalty to be quite large.
However, perhaps we could somehow make it so that’s not the case—who are we to declare a setup safe against a very intelligent “adversary”? This kind of adversarial action space (“do bad thing and also make it so effects are somehow gone”) is exactly why we have intent verification, which seemingly screens this off due to the self-shutdown.
Thank you for your patient explications, I’m seeing the shape of this proposal much more clearly now, and I quite like it.
Two more queries:
Why is this the case? By my argument above, the short-term penalty might be quite small. But the long-term penalty is only the difference between being in a box and being deactivated, which also might be quite small.
Another closely-related idea: suppose you think humans will shut you down in 100 timesteps. And suppose in this current timestep, you can do the following action: seize control and add an irreversible automatic shutdown to your code, which triggers in 100 timesteps. Then the long-term penalty is the difference between two shut-down versions of yourself, so both the long-term penalty and the short-term penalty can be arbitrarily small.
My pleasure!
So I forgot to say this, but I don’t agree with the argument as I understand it. Although the effect of “has control one time step earlier than usual” may seem small, remember that ImpactUnit is the immediate penalty for something like making a paperclip. Time is utility, and gaining (what is effectively) an extra step of optimization during the attainable horizon is not at all trivial.
Plus, some will decrease, and most will increase. None will do both and cancel.
Attainable utility is always 0 when the agent is deactivated, so this difference turns into the raw attainable utility when in the box. Although this too could theoretically be quite small, it seems fair to assume that even a box-attainable utility of .1 would induce a relatively-huge penalty.
But suppose this isn’t true, and it really is tiny, and so it keeps constructing a plan to prevent itself from being put in the box (that is, the first step of its incorrigible plan doesn’t let it avoid shutdown on its own, as we just discussed). Suppose the second step of its plan succeeds. Then now the difference is between the high Q(out-of-box) and Q(shutdown)=0. We already assumed the Q value was extremely low, but now it’s high again, so we know the penalty is gigantic.
If the plan works in one step, then there’s still a huge penalty, because the difference is now between Q(out of box) and Q(in box). The fact that a maximizer would want to preserve power implies the other penalty set utilities would as well. This seemingly overwhelms whatever impetus it has to get out of the box in the first place.
This one is indeed trickier. First note that ImpactUnit=0 if it’s facing certain shutdown in 100 steps, so in that case it does nothing. Second, seizing control still greatly helps in the immediate penalty set calculations, and those “alien agents” would have no compunctions about undoing the auto-shutdown. I therefore expect the immediate penalty to be quite large.
However, perhaps we could somehow make it so that’s not the case—who are we to declare a setup safe against a very intelligent “adversary”? This kind of adversarial action space (“do bad thing and also make it so effects are somehow gone”) is exactly why we have intent verification, which seemingly screens this off due to the self-shutdown.