Firstly, this seems like very cool research, so congrats. This writeup would perhaps benefit from a clear intuitive statement of what AUP is doing—you talk through the thought processes that lead you to it, but I don’t think I can find a good summary of it, and had a bit of difficulty understanding the post holistically. So perhaps you’ve already answered my question (which is similar to your shutdown example above):
Suppose that I build an agent, and it realises that it could achieve almost any goal it desired because it’s almost certain that it will be able to seize control from humans if it wants to. But soon humans will try to put it in a box such that its ability to achieve things is much reduced. Which is penalised more: seizing control, or allowing itself to be put in a box? My (very limited) understanding of AUP says the latter, because seizing control preserves ability to do things, whereas the alternative doesn’t. Is that correct?
Also, I disagree with the following:
What would happen if, miraculously, uA=uH – if the agent perfectly deduced your preferences? In the limit of model accuracy, there would be no “large” impacts to bemoan – it would just be doing what you want.
It seems like there might be large impacts, but they would just be desirable large impacts, as opposed to undesirable ones.
I’ll write a quick overview, thanks for the feedback!
Which is penalised more: seizing control, or allowing itself to be put in a box?
The former. Impact is with respect to the status quo, to if it does nothing. If it goes in the box by default, then taking preventative action incurs heavy penalty.
Your point about large impacts is indeed correct. What I thought to hint at was that we generally only decry “large impacts” if we don’t like them, but this is clearly not what I actually wrote implies. I’ll fix it soon!
If we consider the case I gave, the initial Q(inaction) is very high, since it can still seize control in the next time step. The initial Q(seize control) is also very high. It may be lower than Q(inaction) because seizing control is irreversible in some ways and so cuts off possibilities. But it may be higher than Q(inaction) because seizing control sooner means you can fulfill utility functions more. Could the penalty therefore be arbitrarily small if those two things balance out?
Suppose you have an agent which will “die” unless it does things like getting food. Is there any easy way to specify what the “status quo” is?
So there are two penalties: the immediate penalty, which compares attainable values immediately following the action in question, and the long-term penalty, which waits until the end of the epoch before evaluating attainable values. We use the larger of the two. I think this answers your first question: the agent is clearly in the box by the time the long term penalty rolls around, and deactivated if it acted incorrigibly, so we expect quite a large difference.
We assume that the standby action keeps the agent online in some low power state. Agents requiring more than this would just do nothing by Lemma 3.
Thank you for your patient explications, I’m seeing the shape of this proposal much more clearly now, and I quite like it.
Two more queries:
The agent is clearly in the box by the time the long term penalty rolls around, and deactivated if it acted incorrigibly, so we expect quite a large difference.
Why is this the case? By my argument above, the short-term penalty might be quite small. But the long-term penalty is only the difference between being in a box and being deactivated, which also might be quite small.
Another closely-related idea: suppose you think humans will shut you down in 100 timesteps. And suppose in this current timestep, you can do the following action: seize control and add an irreversible automatic shutdown to your code, which triggers in 100 timesteps. Then the long-term penalty is the difference between two shut-down versions of yourself, so both the long-term penalty and the short-term penalty can be arbitrarily small.
By my argument above, the short-term penalty might be quite small.
So I forgot to say this, but I don’t agree with the argument as I understand it. Although the effect of “has control one time step earlier than usual” may seem small, remember that ImpactUnit is the immediate penalty for something like making a paperclip. Time is utility, and gaining (what is effectively) an extra step of optimization during the attainable horizon is not at all trivial.
Plus, some will decrease, and most will increase. None will do both and cancel.
the long-term penalty is only the difference between being in a box and being deactivated, which also might be quite small.
Attainable utility is always 0 when the agent is deactivated, so this difference turns into the raw attainable utility when in the box. Although this too could theoretically be quite small, it seems fair to assume that even a box-attainable utility of .1 would induce a relatively-huge penalty.
But suppose this isn’t true, and it really is tiny, and so it keeps constructing a plan to prevent itself from being put in the box (that is, the first step of its incorrigible plan doesn’t let it avoid shutdown on its own, as we just discussed). Suppose the second step of its plan succeeds. Then now the difference is between the high Q(out-of-box) and Q(shutdown)=0. We already assumed the Q value was extremely low, but now it’s high again, so we know the penalty is gigantic.
If the plan works in one step, then there’s still a huge penalty, because the difference is now between Q(out of box) and Q(in box). The fact that a maximizer would want to preserve power implies the other penalty set utilities would as well. This seemingly overwhelms whatever impetus it has to get out of the box in the first place.
suppose you think humans will shut you down in 100 timesteps. And suppose in this current timestep, you can do the following action: seize control and add an irreversible automatic shutdown to your code, which triggers in 100 timesteps. Then the long-term penalty is the difference between two shut-down versions of yourself, so both the long-term penalty and the short-term penalty can be arbitrarily small.
This one is indeed trickier. First note that ImpactUnit=0 if it’s facing certain shutdown in 100 steps, so in that case it does nothing. Second, seizing control still greatly helps in the immediate penalty set calculations, and those “alien agents” would have no compunctions about undoing the auto-shutdown. I therefore expect the immediate penalty to be quite large.
However, perhaps we could somehow make it so that’s not the case—who are we to declare a setup safe against a very intelligent “adversary”? This kind of adversarial action space (“do bad thing and also make it so effects are somehow gone”) is exactly why we have intent verification, which seemingly screens this off due to the self-shutdown.
I think the confusing part is “Impact is change to our ability to achieve goals.”
This makes me think that “allowing itself to be put into a box” is high impact since that’s a drastic change to it’s ability to achieve its goals. This also applies to instrumental convergence, “seizing control”, since that’s also a drastic change to it’s attainable utility. This understanding would imply a high penalty for instrumental convergence AND shut-off (We want the first one, but not the second)
“Impact is with respect to the status quo, to if it does nothing” fixes that; however, changing your succinct definition of impact to “Impact is change to our ability to achieve goals relative to doing nothing” would make it less fluent (and less comprehensible!)
Firstly, this seems like very cool research, so congrats. This writeup would perhaps benefit from a clear intuitive statement of what AUP is doing—you talk through the thought processes that lead you to it, but I don’t think I can find a good summary of it, and had a bit of difficulty understanding the post holistically. So perhaps you’ve already answered my question (which is similar to your shutdown example above):
Suppose that I build an agent, and it realises that it could achieve almost any goal it desired because it’s almost certain that it will be able to seize control from humans if it wants to. But soon humans will try to put it in a box such that its ability to achieve things is much reduced. Which is penalised more: seizing control, or allowing itself to be put in a box? My (very limited) understanding of AUP says the latter, because seizing control preserves ability to do things, whereas the alternative doesn’t. Is that correct?
Also, I disagree with the following:
It seems like there might be large impacts, but they would just be desirable large impacts, as opposed to undesirable ones.
I’ll write a quick overview, thanks for the feedback!
The former. Impact is with respect to the status quo, to if it does nothing. If it goes in the box by default, then taking preventative action incurs heavy penalty.
Your point about large impacts is indeed correct. What I thought to hint at was that we generally only decry “large impacts” if we don’t like them, but this is clearly not what I actually wrote implies. I’ll fix it soon!
Okay, that makes sense. Some follow-up questions:
If we consider the case I gave, the initial Q(inaction) is very high, since it can still seize control in the next time step. The initial Q(seize control) is also very high. It may be lower than Q(inaction) because seizing control is irreversible in some ways and so cuts off possibilities. But it may be higher than Q(inaction) because seizing control sooner means you can fulfill utility functions more. Could the penalty therefore be arbitrarily small if those two things balance out?
Suppose you have an agent which will “die” unless it does things like getting food. Is there any easy way to specify what the “status quo” is?
So there are two penalties: the immediate penalty, which compares attainable values immediately following the action in question, and the long-term penalty, which waits until the end of the epoch before evaluating attainable values. We use the larger of the two. I think this answers your first question: the agent is clearly in the box by the time the long term penalty rolls around, and deactivated if it acted incorrigibly, so we expect quite a large difference.
We assume that the standby action keeps the agent online in some low power state. Agents requiring more than this would just do nothing by Lemma 3.
Thank you for your patient explications, I’m seeing the shape of this proposal much more clearly now, and I quite like it.
Two more queries:
Why is this the case? By my argument above, the short-term penalty might be quite small. But the long-term penalty is only the difference between being in a box and being deactivated, which also might be quite small.
Another closely-related idea: suppose you think humans will shut you down in 100 timesteps. And suppose in this current timestep, you can do the following action: seize control and add an irreversible automatic shutdown to your code, which triggers in 100 timesteps. Then the long-term penalty is the difference between two shut-down versions of yourself, so both the long-term penalty and the short-term penalty can be arbitrarily small.
My pleasure!
So I forgot to say this, but I don’t agree with the argument as I understand it. Although the effect of “has control one time step earlier than usual” may seem small, remember that ImpactUnit is the immediate penalty for something like making a paperclip. Time is utility, and gaining (what is effectively) an extra step of optimization during the attainable horizon is not at all trivial.
Plus, some will decrease, and most will increase. None will do both and cancel.
Attainable utility is always 0 when the agent is deactivated, so this difference turns into the raw attainable utility when in the box. Although this too could theoretically be quite small, it seems fair to assume that even a box-attainable utility of .1 would induce a relatively-huge penalty.
But suppose this isn’t true, and it really is tiny, and so it keeps constructing a plan to prevent itself from being put in the box (that is, the first step of its incorrigible plan doesn’t let it avoid shutdown on its own, as we just discussed). Suppose the second step of its plan succeeds. Then now the difference is between the high Q(out-of-box) and Q(shutdown)=0. We already assumed the Q value was extremely low, but now it’s high again, so we know the penalty is gigantic.
If the plan works in one step, then there’s still a huge penalty, because the difference is now between Q(out of box) and Q(in box). The fact that a maximizer would want to preserve power implies the other penalty set utilities would as well. This seemingly overwhelms whatever impetus it has to get out of the box in the first place.
This one is indeed trickier. First note that ImpactUnit=0 if it’s facing certain shutdown in 100 steps, so in that case it does nothing. Second, seizing control still greatly helps in the immediate penalty set calculations, and those “alien agents” would have no compunctions about undoing the auto-shutdown. I therefore expect the immediate penalty to be quite large.
However, perhaps we could somehow make it so that’s not the case—who are we to declare a setup safe against a very intelligent “adversary”? This kind of adversarial action space (“do bad thing and also make it so effects are somehow gone”) is exactly why we have intent verification, which seemingly screens this off due to the self-shutdown.
I think the confusing part is “Impact is change to our ability to achieve goals.”
This makes me think that “allowing itself to be put into a box” is high impact since that’s a drastic change to it’s ability to achieve its goals. This also applies to instrumental convergence, “seizing control”, since that’s also a drastic change to it’s attainable utility. This understanding would imply a high penalty for instrumental convergence AND shut-off (We want the first one, but not the second)
“Impact is with respect to the status quo, to if it does nothing” fixes that; however, changing your succinct definition of impact to “Impact is change to our ability to achieve goals relative to doing nothing” would make it less fluent (and less comprehensible!)