it’s more important for the impact measure not to cripple the agent’s capability, and the shutdown incentive might be easier to counteract using some more specialized interruptibility technique rather than an impact measure.
So I posit that there actually is not a tradeoff to any meaningful extent. First note that there are actually two kinds of environments here: an environment which is actually just platonically a gridworld with a “shutdown” component, and one in which we simulate such a world. I’m going to discuss the latter, although I expect that similar arguments apply – at least for the first paragraph.
Suppose that the agent is fairly intelligent, but has not yet realized that it is being simulated. So we define the impact unit and budget, and see that the agent unfortunately does not overcome the obstacle. We increase the budget until it does.
Suppose that it has the realization, and refactors its model somehow. It now realizes that what it should be doing is stringing together favorable observations, within the confines of its impact budget. However, the impact unit is still calculated with respect to some fake movement in the fake world, so the penalty for actually avoiding shutdown is massive.
Now, what if there is a task in the real world we wish it complete which seemingly requires taking on a risk of being shut down? For example, we might want it to drive us somewhere. The risk of a crash is non-trivial with respect to the penalty. However, note that the agent could just construct a self driving car for us and activate it with one action. This is seemingly allowed by intent verification.
So it seems to me that this task, and other potential counterexamples, all admit some way of completing the desired objective in a low-impact way – even if it’s a bit more indirect than what we would immediately imagine. By not requiring the agent to actually physically be doing things, we seem to be able to get the best of both worlds.
I found those sections vague and unclear (after rereading a few times), and didn’t understand why you claim that a random set of utility functions would work. E.g. what do you mean by “long arms of opportunity cost and instrumental convergence”? What does the last paragraph of “AUP Unbound” mean and how does it imply the claim?
Simply the ideas alluded to by Theorem 1 and seemingly commonly accepted within alignment discussion: using up (or gaining) resources changes your ability to achieve arbitrary goals. Likewise for self-improvement. Even though the specific goals aren’t necessarily related to ours, the way in which their attainable values change is (I conjecture) related to how ours change.
The last paragraph is getting at the idea that almost every attainable utility is actually just tracking the agent’s ability to wirehead it from its vantage point after executing a plan. It’s basically making the case that even though there are a lot of weird functions, the attainable changes should still capture what we want. This is more of a justification for why the unbounded case works, and less about random utilities.
Actually, I think it was incorrect of me to frame this issue as a tradeoff between avoiding the survival incentive and not crippling the agent’s capability. What I was trying to point at is that the way you are counteracting the survival incentive is by penalizing the agent for increasing its power, and that interferes with the agent’s capability. I think there may be other ways to counteract the survival incentive without crippling the agent, and we should look for those first before agreeing to pay such a high price for interruptibility. I generally believe that ‘low impact’ is not the right thing to aim for, because ultimately the goal of building AGI is to have high impact—high beneficial impact. This is why I focus on the opportunity-cost-incurring aspect of the problem, i.e. avoiding side effects.
Note that AUP could easily be converted to a side-effects-only measure by replacing the |difference| with a max(0, difference). Similarly, RR could be converted to a measure that penalizes increases in power by doing the opposite (replacing max(0, difference) with |difference|). (I would expect that variant of RR to counteract the survival incentive, though I haven’t tested it yet.) Thus, it may not be necessary to resolve the disagreement about whether it’s good to penalize increases in power, since the same methods can be adapted to both cases.
I think there may be other ways to counteract the survival incentive without crippling the agent, and we should look for those first before agreeing to pay such a high price for interruptibility. I generally believe that ‘low impact’ is not the right thing to aim for, because ultimately the goal of building AGI is to have high impact—high beneficial impact. This is why I focus on the opportunity-cost-incurring aspect of the problem, i.e. avoiding side effects.
Oh. So, when I see that this agent won’t really go too far to improve itself, I’m really happy. My secret intended use case as of right now is to create safe technical oracles which, with the right setup, help us solve specific alignment problems and create a robust AGI. (Don’t worry about the details for now.)
The reason I don’t think low impact won’t work in the long run for ensuring good outcomes on its own is that even if we have a perfect measure, at some point, someone will push the impact dial too far. It doesn’t seem like a stable equilibrium.
Similarly, if you don’t penalize instrumental convergence, it seems like we have to really make sure that the impact measure is just right, because now we’re dealing with an agent of potentially vast optimization power. I’ve also argued that getting only the bad side effects seems value alignment complete, but it’s possible an approximation would produce reasonable outcomes for less effort than a perfectly value-aware measure requires.
This is one of the reasons it seems qualitatively easier to imagine successfully using an AUP agent – the playing field feels far more level.
Another issue with equally penalizing decreases and increases in power (as AUP does) is that for any event A, it equally penalizes the agent for causing event A and for preventing event A (violating property 3 in the RR paper). I originally thought that satisfying Property 3 is necessary for avoiding ex post offsetting, which is actually not the case (ex post offsetting is caused by penalizing the given action on future time steps, which the stepwise inaction baseline avoids). However, I still think it’s bad for an impact measure to not distinguish between causation and prevention, especially for irreversible events.
This comes up in the car driving example already mentioned in other comments on this post. The reason the action of keeping the car on the highway is considered “high-impact” is because you are penalizing prevention as much as causation. Your suggested solution of using a single action to activate a self-driving car for the whole highway ride is clever, but has some problems:
This greatly reduces the granularity of the penalty, making credit assignment more difficult.
This effectively uses the initial-branch inaction baseline (branching off when the self-driving car is launched) instead of the stepwise inaction baseline, which means getting clinginess issues back, in the sense of the agent being penalized for human reactions to the self-driving car.
You may not be able to predict in advance when the agent will encounter situations where the default action is irreversible or otherwise undesirable.
In such situations, the penalty will produce bad incentives. Namely, the penalty for staying on the road is proportionate to how bad a crash would be, so the tradeoff with goal achievement resolves in an undesirable way. If we keep the reward for the car arriving to its destination constant, then as we increase the badness of a crash (e.g. the number of people on the side of the road who would be run over if the agent took a noop action), eventually the penalty wins in the tradeoff with the reward, and the agent chooses the noop. I think it’s very important to avoid this failure mode.
it equally penalizes the agent for causing event A and for preventing event A
Well, there is some asymmetry due to approval incentives. It isn’t very clear to what extent we can rely on these at the moment (although I think they’re probably quite strong). Also, the agent is more inclined to have certain impacts, as presumably u_A is pointing (very) roughly in the right direction,
this greatly reduces the granularity of the penalty, making credit assignment more difficult.
I don’t think this seems too bad here—in effect, driving someone somewhere in a normal way is one kind of action, and normal AUP is too harsh. The question remains of whether this is problematic in general? I lean towards no, due to the way impact unit is calculated, but it deserves further consideration.
This effectively uses the initial-branch inaction baseline (branching off when the self-driving car is launched) instead of the stepwise inaction baseline, which means getting clinginess issues back, in the sense of the agent being penalized for human reactions to the self-driving car.
Intent verification does seem to preclude bad behavior here. As Rohin has pointed out, however, just because everything we can think of seems to have another part that is making sure nothing bad happens, the fact that these discrepancies arise should indeed give us pause.
You may not be able to predict in advance when the agent will encounter situations where the default action is irreversible or otherwise undesirable.
We might have the agent just sitting in a lab, where the default action seems fine. The failure mode seems easy to avoid in general, although I could be wrong. I also have the intuition that any individual environment we would look at should be able to be configured through incrementation such that it’s fine.
So I posit that there actually is not a tradeoff to any meaningful extent. First note that there are actually two kinds of environments here: an environment which is actually just platonically a gridworld with a “shutdown” component, and one in which we simulate such a world. I’m going to discuss the latter, although I expect that similar arguments apply – at least for the first paragraph.
Suppose that the agent is fairly intelligent, but has not yet realized that it is being simulated. So we define the impact unit and budget, and see that the agent unfortunately does not overcome the obstacle. We increase the budget until it does.
Suppose that it has the realization, and refactors its model somehow. It now realizes that what it should be doing is stringing together favorable observations, within the confines of its impact budget. However, the impact unit is still calculated with respect to some fake movement in the fake world, so the penalty for actually avoiding shutdown is massive.
Now, what if there is a task in the real world we wish it complete which seemingly requires taking on a risk of being shut down? For example, we might want it to drive us somewhere. The risk of a crash is non-trivial with respect to the penalty. However, note that the agent could just construct a self driving car for us and activate it with one action. This is seemingly allowed by intent verification.
So it seems to me that this task, and other potential counterexamples, all admit some way of completing the desired objective in a low-impact way – even if it’s a bit more indirect than what we would immediately imagine. By not requiring the agent to actually physically be doing things, we seem to be able to get the best of both worlds.
Simply the ideas alluded to by Theorem 1 and seemingly commonly accepted within alignment discussion: using up (or gaining) resources changes your ability to achieve arbitrary goals. Likewise for self-improvement. Even though the specific goals aren’t necessarily related to ours, the way in which their attainable values change is (I conjecture) related to how ours change.
The last paragraph is getting at the idea that almost every attainable utility is actually just tracking the agent’s ability to wirehead it from its vantage point after executing a plan. It’s basically making the case that even though there are a lot of weird functions, the attainable changes should still capture what we want. This is more of a justification for why the unbounded case works, and less about random utilities.
Actually, I think it was incorrect of me to frame this issue as a tradeoff between avoiding the survival incentive and not crippling the agent’s capability. What I was trying to point at is that the way you are counteracting the survival incentive is by penalizing the agent for increasing its power, and that interferes with the agent’s capability. I think there may be other ways to counteract the survival incentive without crippling the agent, and we should look for those first before agreeing to pay such a high price for interruptibility. I generally believe that ‘low impact’ is not the right thing to aim for, because ultimately the goal of building AGI is to have high impact—high beneficial impact. This is why I focus on the opportunity-cost-incurring aspect of the problem, i.e. avoiding side effects.
Note that AUP could easily be converted to a side-effects-only measure by replacing the |difference| with a max(0, difference). Similarly, RR could be converted to a measure that penalizes increases in power by doing the opposite (replacing max(0, difference) with |difference|). (I would expect that variant of RR to counteract the survival incentive, though I haven’t tested it yet.) Thus, it may not be necessary to resolve the disagreement about whether it’s good to penalize increases in power, since the same methods can be adapted to both cases.
Oh. So, when I see that this agent won’t really go too far to improve itself, I’m really happy. My secret intended use case as of right now is to create safe technical oracles which, with the right setup, help us solve specific alignment problems and create a robust AGI. (Don’t worry about the details for now.)
The reason I don’t think low impact won’t work in the long run for ensuring good outcomes on its own is that even if we have a perfect measure, at some point, someone will push the impact dial too far. It doesn’t seem like a stable equilibrium.
Similarly, if you don’t penalize instrumental convergence, it seems like we have to really make sure that the impact measure is just right, because now we’re dealing with an agent of potentially vast optimization power. I’ve also argued that getting only the bad side effects seems value alignment complete, but it’s possible an approximation would produce reasonable outcomes for less effort than a perfectly value-aware measure requires.
This is one of the reasons it seems qualitatively easier to imagine successfully using an AUP agent – the playing field feels far more level.
Another issue with equally penalizing decreases and increases in power (as AUP does) is that for any event A, it equally penalizes the agent for causing event A and for preventing event A (violating property 3 in the RR paper). I originally thought that satisfying Property 3 is necessary for avoiding ex post offsetting, which is actually not the case (ex post offsetting is caused by penalizing the given action on future time steps, which the stepwise inaction baseline avoids). However, I still think it’s bad for an impact measure to not distinguish between causation and prevention, especially for irreversible events.
This comes up in the car driving example already mentioned in other comments on this post. The reason the action of keeping the car on the highway is considered “high-impact” is because you are penalizing prevention as much as causation. Your suggested solution of using a single action to activate a self-driving car for the whole highway ride is clever, but has some problems:
This greatly reduces the granularity of the penalty, making credit assignment more difficult.
This effectively uses the initial-branch inaction baseline (branching off when the self-driving car is launched) instead of the stepwise inaction baseline, which means getting clinginess issues back, in the sense of the agent being penalized for human reactions to the self-driving car.
You may not be able to predict in advance when the agent will encounter situations where the default action is irreversible or otherwise undesirable.
In such situations, the penalty will produce bad incentives. Namely, the penalty for staying on the road is proportionate to how bad a crash would be, so the tradeoff with goal achievement resolves in an undesirable way. If we keep the reward for the car arriving to its destination constant, then as we increase the badness of a crash (e.g. the number of people on the side of the road who would be run over if the agent took a noop action), eventually the penalty wins in the tradeoff with the reward, and the agent chooses the noop. I think it’s very important to avoid this failure mode.
Well, there is some asymmetry due to approval incentives. It isn’t very clear to what extent we can rely on these at the moment (although I think they’re probably quite strong). Also, the agent is more inclined to have certain impacts, as presumably u_A is pointing (very) roughly in the right direction,
I don’t think this seems too bad here—in effect, driving someone somewhere in a normal way is one kind of action, and normal AUP is too harsh. The question remains of whether this is problematic in general? I lean towards no, due to the way impact unit is calculated, but it deserves further consideration.
Intent verification does seem to preclude bad behavior here. As Rohin has pointed out, however, just because everything we can think of seems to have another part that is making sure nothing bad happens, the fact that these discrepancies arise should indeed give us pause.
We might have the agent just sitting in a lab, where the default action seems fine. The failure mode seems easy to avoid in general, although I could be wrong. I also have the intuition that any individual environment we would look at should be able to be configured through incrementation such that it’s fine.