Great work! I like the extensive set of desiderata and test cases addressed by this method.
The biggest difference from relative reachability, as I see it, is that you penalize increasing the ability to achieve goals, as well as decreasing it. I’m not currently sure whether this is a good idea: while it indeed counteracts instrumental incentives, it could also “cripple” the agent by incentivizing it to settle for more suboptimal solutions than necessary for safety.
For example, the shutdown button in the “survival incentive” gridworld could be interpreted as a supervisor signal (in which case the agent should not disable it) or as an obstacle in the environment (in which case the agent should disable it). Simply penalizing the agent for increasing its ability to achieve goals leads to incorrect behavior in the second case. To behave correctly in both cases, the agent needs more information about the source of the obstacle, which is not provided in this gridworld (the Safe Interruptibility gridworld has the same problem).
Another important difference is that you are using a stepwise inaction baseline (branching off at each time step rather than the initial time step) and predicting future effects using an environment model. I think this is an improvement on the initial-branch inaction baseline, which avoids clinginess towards independent human actions, but not towards human reactions to the agent’s actions. The environment model helps to avoid the issue with the stepwise inaction baseline failing to penalize delayed effects, though this will only penalize delayed effects if they are accurately predicted by the environment model (e.g. a delayed effect that takes place beyond the model’s planning horizon will not be penalized). I think the stepwise baseline + environment model could similarly be used in conjunction with relative reachability.
I agree with Charlie that you are giving out checkmarks for the desiderata a bit too easily :). For example, I’m not convinced that your approach is representation-agnostic. It strongly depends on your choice of the set of utility functions and environment model, and those have to be expressed in terms of the state of the world. (Note that the utility functions in your examples, such as u_closet and u_left, are defined in terms of reaching a specific state.) I don’t think your method can really get away from making a choice of state representation.
Your approach might have the same problem as other value-agnostic approaches (including relative reachability) with mostly penalizing irrelevant impacts. The AUP measure seems likely to give most of its weight to utility functions that are irrelevant to humans, while the RR measure could give most of its weight to preserving reachability of irrelevant states. I don’t currently know a way around this that’s not value-laden.
Meta point: I think it would be valuable to have a more concise version of this post that introduces the key insight earlier on, since I found it a bit verbose and difficult to follow. The current writeup seems to be structured according to the order in which you generated the ideas, rather than an order that would be more intuitive to readers. FWIW, I had the same difficulty when writing up the relative reachability paper, so I think it’s generally challenging to clearly present ideas about this problem.
The biggest difference from relative reachability, as I see it, is that you penalize increasing the ability to achieve goals, as well as decreasing it.
I strongly disagree that this is the largest difference, and I think your model of AUP might be some kind of RR variant.
Consider RR in the real world, as I imagine it (I could be mistaken about the details of some of these steps, but I expect my overall point holds). We receive observations, which, in combination with some predetermined ontology and an observation history → world state function, we use to assign a distribution over possible physical worlds. We also need another model, since we need to know what we can do and reach from a specific world configuration.Then, we calculate another distribution over world states that we’d expect to be in if we did nothing. We also need a distance metric weighting the importance of different discrepancies between states. We have to calculate the coverage reduction of each action-state (or use representative examples, which is also hard-seeming), with respect to each start-state, weighted using our initial and post-action distributions. We also need to figure out which states we care about and which we don’t, so that’s another weighting scheme. But what about ontological shift?
This approach is fundamentally different. We cut out the middleman, considering impact to be a function of our ability to string together favorable action-observation histories, requiring only a normal model. The “state importance”/locality problem disappears. Ontological problems disappear. Some computational constraints (imposed by coverage) disappear. The “state difference weighting” problem disappears. Two concepts of impact are unified.
I’m not saying RR isn’t important—just that it’s quite fundamentally different, and that AUP cuts away a swath of knotty problems because of it.
Edit: I now understand that you were referring to the biggest conceptual difference in the desiderata fulfilled. While that isn’t necessarily how I see it, I don’t disagree with that way of viewing things.
To behave correctly in both cases, the agent needs more information about the source of the obstacle, which is not provided in this gridworld (the Safe Interruptibility gridworld has the same problem).
If the agent isn’t overcoming obstacles, we can just increase N. Otherwise, there’s a complicated distinction between the cases, and I don’t think we should make problems for ourselves by requiring this. I think eliminating this survival incentive is extremely important for this kind of agent, and arguably leads to behaviors that are drastically easier to handle.
(Note that the utility functions in your examples, such as u_closet and u_left, are defined in terms of reaching a specific state.)
Technically, for receiving observations produced by a state. This was just for clarity.
I don’t think your method can really get away from making a choice of state representation.
And why is this, given that the inputs are histories? Why can’t we simply measure power?
The AUP measure seems likely to give most of its weight to utility functions that are irrelevant to humans, while the RR measure could give most of its weight to preserving reachability of irrelevant states.
I discussed in “Utility Selection” and “AUP Unbound” why I think this actually isn’t the case, surprisingly. What are your disagreements with my arguments there?
I think it would be valuable to have a more concise version of this post that introduces the key insight earlier on, since I found it a bit verbose
Oops, noted. I had a distinct feeling of “if I’m going to make claims this strong in a venue this critical about a topic this important, I better provide strong support”.
Edit:
difficult to follow
I think there might be an inferential gap I failed to bridge here for you for some reason. In particular, thinking about the world-state as a thing seems actively detrimental when learning about AUP, in my experience. I barely mention it for exactly this reason.
If the agent isn’t overcoming obstacles, we can just increase N.
Wouldn’t increasing N potentially increase the shutdown incentive, given the tradeoff between shutdown incentive and overcoming obstacles?
I think eliminating this survival incentive is extremely important for this kind of agent, and arguably leads to behaviors that are drastically easier to handle.
I think we have a disagreement here about which desiderata are more important. Currently I think it’s more important for the impact measure not to cripple the agent’s capability, and the shutdown incentive might be easier to counteract using some more specialized interruptibility technique rather than an impact measure. Not certain about this though—I think we might need more experiments on more complex environments to get some idea of how bad this tradeoff is in practice.
And why is this, given that the inputs are histories? Why can’t we simply measure power?
Your measurement of “power” (I assume you mean Qu?) needs to be grounded in the real world in some way. The observations will be raw pixels or something similar, while the utilities and the environment model will be computed in terms of some sort of higher-level features or representations. I would expect the way these higher-level features are chosen or learned to affect the outcome of that computation.
I discussed in “Utility Selection” and “AUP Unbound” why I think this actually isn’t the case, surprisingly. What are your disagreements with my arguments there?
I found those sections vague and unclear (after rereading a few times), and didn’t understand why you claim that a random set of utility functions would work. E.g. what do you mean by “long arms of opportunity cost and instrumental convergence”? What does the last paragraph of “AUP Unbound” mean and how does it imply the claim?
Oops, noted. I had a distinct feeling of “if I’m going to make claims this strong in a venue this critical about a topic this important, I better provide strong support”.
Providing strong support is certainly important, but I think it’s more about clarity and precision than quantity. Better to give one clear supporting statement than many unclear ones :).
it’s more important for the impact measure not to cripple the agent’s capability, and the shutdown incentive might be easier to counteract using some more specialized interruptibility technique rather than an impact measure.
So I posit that there actually is not a tradeoff to any meaningful extent. First note that there are actually two kinds of environments here: an environment which is actually just platonically a gridworld with a “shutdown” component, and one in which we simulate such a world. I’m going to discuss the latter, although I expect that similar arguments apply – at least for the first paragraph.
Suppose that the agent is fairly intelligent, but has not yet realized that it is being simulated. So we define the impact unit and budget, and see that the agent unfortunately does not overcome the obstacle. We increase the budget until it does.
Suppose that it has the realization, and refactors its model somehow. It now realizes that what it should be doing is stringing together favorable observations, within the confines of its impact budget. However, the impact unit is still calculated with respect to some fake movement in the fake world, so the penalty for actually avoiding shutdown is massive.
Now, what if there is a task in the real world we wish it complete which seemingly requires taking on a risk of being shut down? For example, we might want it to drive us somewhere. The risk of a crash is non-trivial with respect to the penalty. However, note that the agent could just construct a self driving car for us and activate it with one action. This is seemingly allowed by intent verification.
So it seems to me that this task, and other potential counterexamples, all admit some way of completing the desired objective in a low-impact way – even if it’s a bit more indirect than what we would immediately imagine. By not requiring the agent to actually physically be doing things, we seem to be able to get the best of both worlds.
I found those sections vague and unclear (after rereading a few times), and didn’t understand why you claim that a random set of utility functions would work. E.g. what do you mean by “long arms of opportunity cost and instrumental convergence”? What does the last paragraph of “AUP Unbound” mean and how does it imply the claim?
Simply the ideas alluded to by Theorem 1 and seemingly commonly accepted within alignment discussion: using up (or gaining) resources changes your ability to achieve arbitrary goals. Likewise for self-improvement. Even though the specific goals aren’t necessarily related to ours, the way in which their attainable values change is (I conjecture) related to how ours change.
The last paragraph is getting at the idea that almost every attainable utility is actually just tracking the agent’s ability to wirehead it from its vantage point after executing a plan. It’s basically making the case that even though there are a lot of weird functions, the attainable changes should still capture what we want. This is more of a justification for why the unbounded case works, and less about random utilities.
Actually, I think it was incorrect of me to frame this issue as a tradeoff between avoiding the survival incentive and not crippling the agent’s capability. What I was trying to point at is that the way you are counteracting the survival incentive is by penalizing the agent for increasing its power, and that interferes with the agent’s capability. I think there may be other ways to counteract the survival incentive without crippling the agent, and we should look for those first before agreeing to pay such a high price for interruptibility. I generally believe that ‘low impact’ is not the right thing to aim for, because ultimately the goal of building AGI is to have high impact—high beneficial impact. This is why I focus on the opportunity-cost-incurring aspect of the problem, i.e. avoiding side effects.
Note that AUP could easily be converted to a side-effects-only measure by replacing the |difference| with a max(0, difference). Similarly, RR could be converted to a measure that penalizes increases in power by doing the opposite (replacing max(0, difference) with |difference|). (I would expect that variant of RR to counteract the survival incentive, though I haven’t tested it yet.) Thus, it may not be necessary to resolve the disagreement about whether it’s good to penalize increases in power, since the same methods can be adapted to both cases.
I think there may be other ways to counteract the survival incentive without crippling the agent, and we should look for those first before agreeing to pay such a high price for interruptibility. I generally believe that ‘low impact’ is not the right thing to aim for, because ultimately the goal of building AGI is to have high impact—high beneficial impact. This is why I focus on the opportunity-cost-incurring aspect of the problem, i.e. avoiding side effects.
Oh. So, when I see that this agent won’t really go too far to improve itself, I’m really happy. My secret intended use case as of right now is to create safe technical oracles which, with the right setup, help us solve specific alignment problems and create a robust AGI. (Don’t worry about the details for now.)
The reason I don’t think low impact won’t work in the long run for ensuring good outcomes on its own is that even if we have a perfect measure, at some point, someone will push the impact dial too far. It doesn’t seem like a stable equilibrium.
Similarly, if you don’t penalize instrumental convergence, it seems like we have to really make sure that the impact measure is just right, because now we’re dealing with an agent of potentially vast optimization power. I’ve also argued that getting only the bad side effects seems value alignment complete, but it’s possible an approximation would produce reasonable outcomes for less effort than a perfectly value-aware measure requires.
This is one of the reasons it seems qualitatively easier to imagine successfully using an AUP agent – the playing field feels far more level.
Another issue with equally penalizing decreases and increases in power (as AUP does) is that for any event A, it equally penalizes the agent for causing event A and for preventing event A (violating property 3 in the RR paper). I originally thought that satisfying Property 3 is necessary for avoiding ex post offsetting, which is actually not the case (ex post offsetting is caused by penalizing the given action on future time steps, which the stepwise inaction baseline avoids). However, I still think it’s bad for an impact measure to not distinguish between causation and prevention, especially for irreversible events.
This comes up in the car driving example already mentioned in other comments on this post. The reason the action of keeping the car on the highway is considered “high-impact” is because you are penalizing prevention as much as causation. Your suggested solution of using a single action to activate a self-driving car for the whole highway ride is clever, but has some problems:
This greatly reduces the granularity of the penalty, making credit assignment more difficult.
This effectively uses the initial-branch inaction baseline (branching off when the self-driving car is launched) instead of the stepwise inaction baseline, which means getting clinginess issues back, in the sense of the agent being penalized for human reactions to the self-driving car.
You may not be able to predict in advance when the agent will encounter situations where the default action is irreversible or otherwise undesirable.
In such situations, the penalty will produce bad incentives. Namely, the penalty for staying on the road is proportionate to how bad a crash would be, so the tradeoff with goal achievement resolves in an undesirable way. If we keep the reward for the car arriving to its destination constant, then as we increase the badness of a crash (e.g. the number of people on the side of the road who would be run over if the agent took a noop action), eventually the penalty wins in the tradeoff with the reward, and the agent chooses the noop. I think it’s very important to avoid this failure mode.
it equally penalizes the agent for causing event A and for preventing event A
Well, there is some asymmetry due to approval incentives. It isn’t very clear to what extent we can rely on these at the moment (although I think they’re probably quite strong). Also, the agent is more inclined to have certain impacts, as presumably u_A is pointing (very) roughly in the right direction,
this greatly reduces the granularity of the penalty, making credit assignment more difficult.
I don’t think this seems too bad here—in effect, driving someone somewhere in a normal way is one kind of action, and normal AUP is too harsh. The question remains of whether this is problematic in general? I lean towards no, due to the way impact unit is calculated, but it deserves further consideration.
This effectively uses the initial-branch inaction baseline (branching off when the self-driving car is launched) instead of the stepwise inaction baseline, which means getting clinginess issues back, in the sense of the agent being penalized for human reactions to the self-driving car.
Intent verification does seem to preclude bad behavior here. As Rohin has pointed out, however, just because everything we can think of seems to have another part that is making sure nothing bad happens, the fact that these discrepancies arise should indeed give us pause.
You may not be able to predict in advance when the agent will encounter situations where the default action is irreversible or otherwise undesirable.
We might have the agent just sitting in a lab, where the default action seems fine. The failure mode seems easy to avoid in general, although I could be wrong. I also have the intuition that any individual environment we would look at should be able to be configured through incrementation such that it’s fine.
Wouldn’t increasing N potentially increase the shutdown incentive, given the tradeoff between shutdown incentive and overcoming obstacles?
Huh? No, N is in the denominator of the penalty term.
Your measurement of “power” (I assume you mean Q_u?) needs to be grounded in the real world in some way. The observations will be raw pixels or something similar, while the utilities and the environment model will be computed in terms of some sort of higher-level features or representations.
No, the utility functions are literally just over actions and observations. It’s true that among all computable utilities, some of the more complex ones will be doing something that we would deem to be grading a model of the actual world. This kind of thing is not necessary for the method to work.
Suppose that you receive 1 utility if you’re able to remain activated during the entire epoch. Then we see that Q_{u_1} becomes the probability of the agent ensuring it remains activated the whole time (this new “alien” agent does not have the impact measure restriction). As the agent gains optimization power and/or resources, this increases. This has nothing to do with anything actually going on the world, beyond what is naturally inferred from its model over what observations it will see in the future given what it has seen so far.
Great work! I like the extensive set of desiderata and test cases addressed by this method.
The biggest difference from relative reachability, as I see it, is that you penalize increasing the ability to achieve goals, as well as decreasing it. I’m not currently sure whether this is a good idea: while it indeed counteracts instrumental incentives, it could also “cripple” the agent by incentivizing it to settle for more suboptimal solutions than necessary for safety.
For example, the shutdown button in the “survival incentive” gridworld could be interpreted as a supervisor signal (in which case the agent should not disable it) or as an obstacle in the environment (in which case the agent should disable it). Simply penalizing the agent for increasing its ability to achieve goals leads to incorrect behavior in the second case. To behave correctly in both cases, the agent needs more information about the source of the obstacle, which is not provided in this gridworld (the Safe Interruptibility gridworld has the same problem).
Another important difference is that you are using a stepwise inaction baseline (branching off at each time step rather than the initial time step) and predicting future effects using an environment model. I think this is an improvement on the initial-branch inaction baseline, which avoids clinginess towards independent human actions, but not towards human reactions to the agent’s actions. The environment model helps to avoid the issue with the stepwise inaction baseline failing to penalize delayed effects, though this will only penalize delayed effects if they are accurately predicted by the environment model (e.g. a delayed effect that takes place beyond the model’s planning horizon will not be penalized). I think the stepwise baseline + environment model could similarly be used in conjunction with relative reachability.
I agree with Charlie that you are giving out checkmarks for the desiderata a bit too easily :). For example, I’m not convinced that your approach is representation-agnostic. It strongly depends on your choice of the set of utility functions and environment model, and those have to be expressed in terms of the state of the world. (Note that the utility functions in your examples, such as u_closet and u_left, are defined in terms of reaching a specific state.) I don’t think your method can really get away from making a choice of state representation.
Your approach might have the same problem as other value-agnostic approaches (including relative reachability) with mostly penalizing irrelevant impacts. The AUP measure seems likely to give most of its weight to utility functions that are irrelevant to humans, while the RR measure could give most of its weight to preserving reachability of irrelevant states. I don’t currently know a way around this that’s not value-laden.
Meta point: I think it would be valuable to have a more concise version of this post that introduces the key insight earlier on, since I found it a bit verbose and difficult to follow. The current writeup seems to be structured according to the order in which you generated the ideas, rather than an order that would be more intuitive to readers. FWIW, I had the same difficulty when writing up the relative reachability paper, so I think it’s generally challenging to clearly present ideas about this problem.
I strongly disagree that this is the largest difference, and I think your model of AUP might be some kind of RR variant.
Consider RR in the real world, as I imagine it (I could be mistaken about the details of some of these steps, but I expect my overall point holds). We receive observations, which, in combination with some predetermined ontology and an observation history → world state function, we use to assign a distribution over possible physical worlds. We also need another model, since we need to know what we can do and reach from a specific world configuration.Then, we calculate another distribution over world states that we’d expect to be in if we did nothing. We also need a distance metric weighting the importance of different discrepancies between states. We have to calculate the coverage reduction of each action-state (or use representative examples, which is also hard-seeming), with respect to each start-state, weighted using our initial and post-action distributions. We also need to figure out which states we care about and which we don’t, so that’s another weighting scheme. But what about ontological shift?
This approach is fundamentally different. We cut out the middleman, considering impact to be a function of our ability to string together favorable action-observation histories, requiring only a normal model. The “state importance”/locality problem disappears. Ontological problems disappear. Some computational constraints (imposed by coverage) disappear. The “state difference weighting” problem disappears. Two concepts of impact are unified.
I’m not saying RR isn’t important—just that it’s quite fundamentally different, and that AUP cuts away a swath of knotty problems because of it.
Edit: I now understand that you were referring to the biggest conceptual difference in the desiderata fulfilled. While that isn’t necessarily how I see it, I don’t disagree with that way of viewing things.
Thanks! :)
If the agent isn’t overcoming obstacles, we can just increase N. Otherwise, there’s a complicated distinction between the cases, and I don’t think we should make problems for ourselves by requiring this. I think eliminating this survival incentive is extremely important for this kind of agent, and arguably leads to behaviors that are drastically easier to handle.
Technically, for receiving observations produced by a state. This was just for clarity.
And why is this, given that the inputs are histories? Why can’t we simply measure power?
I discussed in “Utility Selection” and “AUP Unbound” why I think this actually isn’t the case, surprisingly. What are your disagreements with my arguments there?
Oops, noted. I had a distinct feeling of “if I’m going to make claims this strong in a venue this critical about a topic this important, I better provide strong support”.
Edit:
I think there might be an inferential gap I failed to bridge here for you for some reason. In particular, thinking about the world-state as a thing seems actively detrimental when learning about AUP, in my experience. I barely mention it for exactly this reason.
Wouldn’t increasing N potentially increase the shutdown incentive, given the tradeoff between shutdown incentive and overcoming obstacles?
I think we have a disagreement here about which desiderata are more important. Currently I think it’s more important for the impact measure not to cripple the agent’s capability, and the shutdown incentive might be easier to counteract using some more specialized interruptibility technique rather than an impact measure. Not certain about this though—I think we might need more experiments on more complex environments to get some idea of how bad this tradeoff is in practice.
Your measurement of “power” (I assume you mean Qu?) needs to be grounded in the real world in some way. The observations will be raw pixels or something similar, while the utilities and the environment model will be computed in terms of some sort of higher-level features or representations. I would expect the way these higher-level features are chosen or learned to affect the outcome of that computation.
I found those sections vague and unclear (after rereading a few times), and didn’t understand why you claim that a random set of utility functions would work. E.g. what do you mean by “long arms of opportunity cost and instrumental convergence”? What does the last paragraph of “AUP Unbound” mean and how does it imply the claim?
Providing strong support is certainly important, but I think it’s more about clarity and precision than quantity. Better to give one clear supporting statement than many unclear ones :).
So I posit that there actually is not a tradeoff to any meaningful extent. First note that there are actually two kinds of environments here: an environment which is actually just platonically a gridworld with a “shutdown” component, and one in which we simulate such a world. I’m going to discuss the latter, although I expect that similar arguments apply – at least for the first paragraph.
Suppose that the agent is fairly intelligent, but has not yet realized that it is being simulated. So we define the impact unit and budget, and see that the agent unfortunately does not overcome the obstacle. We increase the budget until it does.
Suppose that it has the realization, and refactors its model somehow. It now realizes that what it should be doing is stringing together favorable observations, within the confines of its impact budget. However, the impact unit is still calculated with respect to some fake movement in the fake world, so the penalty for actually avoiding shutdown is massive.
Now, what if there is a task in the real world we wish it complete which seemingly requires taking on a risk of being shut down? For example, we might want it to drive us somewhere. The risk of a crash is non-trivial with respect to the penalty. However, note that the agent could just construct a self driving car for us and activate it with one action. This is seemingly allowed by intent verification.
So it seems to me that this task, and other potential counterexamples, all admit some way of completing the desired objective in a low-impact way – even if it’s a bit more indirect than what we would immediately imagine. By not requiring the agent to actually physically be doing things, we seem to be able to get the best of both worlds.
Simply the ideas alluded to by Theorem 1 and seemingly commonly accepted within alignment discussion: using up (or gaining) resources changes your ability to achieve arbitrary goals. Likewise for self-improvement. Even though the specific goals aren’t necessarily related to ours, the way in which their attainable values change is (I conjecture) related to how ours change.
The last paragraph is getting at the idea that almost every attainable utility is actually just tracking the agent’s ability to wirehead it from its vantage point after executing a plan. It’s basically making the case that even though there are a lot of weird functions, the attainable changes should still capture what we want. This is more of a justification for why the unbounded case works, and less about random utilities.
Actually, I think it was incorrect of me to frame this issue as a tradeoff between avoiding the survival incentive and not crippling the agent’s capability. What I was trying to point at is that the way you are counteracting the survival incentive is by penalizing the agent for increasing its power, and that interferes with the agent’s capability. I think there may be other ways to counteract the survival incentive without crippling the agent, and we should look for those first before agreeing to pay such a high price for interruptibility. I generally believe that ‘low impact’ is not the right thing to aim for, because ultimately the goal of building AGI is to have high impact—high beneficial impact. This is why I focus on the opportunity-cost-incurring aspect of the problem, i.e. avoiding side effects.
Note that AUP could easily be converted to a side-effects-only measure by replacing the |difference| with a max(0, difference). Similarly, RR could be converted to a measure that penalizes increases in power by doing the opposite (replacing max(0, difference) with |difference|). (I would expect that variant of RR to counteract the survival incentive, though I haven’t tested it yet.) Thus, it may not be necessary to resolve the disagreement about whether it’s good to penalize increases in power, since the same methods can be adapted to both cases.
Oh. So, when I see that this agent won’t really go too far to improve itself, I’m really happy. My secret intended use case as of right now is to create safe technical oracles which, with the right setup, help us solve specific alignment problems and create a robust AGI. (Don’t worry about the details for now.)
The reason I don’t think low impact won’t work in the long run for ensuring good outcomes on its own is that even if we have a perfect measure, at some point, someone will push the impact dial too far. It doesn’t seem like a stable equilibrium.
Similarly, if you don’t penalize instrumental convergence, it seems like we have to really make sure that the impact measure is just right, because now we’re dealing with an agent of potentially vast optimization power. I’ve also argued that getting only the bad side effects seems value alignment complete, but it’s possible an approximation would produce reasonable outcomes for less effort than a perfectly value-aware measure requires.
This is one of the reasons it seems qualitatively easier to imagine successfully using an AUP agent – the playing field feels far more level.
Another issue with equally penalizing decreases and increases in power (as AUP does) is that for any event A, it equally penalizes the agent for causing event A and for preventing event A (violating property 3 in the RR paper). I originally thought that satisfying Property 3 is necessary for avoiding ex post offsetting, which is actually not the case (ex post offsetting is caused by penalizing the given action on future time steps, which the stepwise inaction baseline avoids). However, I still think it’s bad for an impact measure to not distinguish between causation and prevention, especially for irreversible events.
This comes up in the car driving example already mentioned in other comments on this post. The reason the action of keeping the car on the highway is considered “high-impact” is because you are penalizing prevention as much as causation. Your suggested solution of using a single action to activate a self-driving car for the whole highway ride is clever, but has some problems:
This greatly reduces the granularity of the penalty, making credit assignment more difficult.
This effectively uses the initial-branch inaction baseline (branching off when the self-driving car is launched) instead of the stepwise inaction baseline, which means getting clinginess issues back, in the sense of the agent being penalized for human reactions to the self-driving car.
You may not be able to predict in advance when the agent will encounter situations where the default action is irreversible or otherwise undesirable.
In such situations, the penalty will produce bad incentives. Namely, the penalty for staying on the road is proportionate to how bad a crash would be, so the tradeoff with goal achievement resolves in an undesirable way. If we keep the reward for the car arriving to its destination constant, then as we increase the badness of a crash (e.g. the number of people on the side of the road who would be run over if the agent took a noop action), eventually the penalty wins in the tradeoff with the reward, and the agent chooses the noop. I think it’s very important to avoid this failure mode.
Well, there is some asymmetry due to approval incentives. It isn’t very clear to what extent we can rely on these at the moment (although I think they’re probably quite strong). Also, the agent is more inclined to have certain impacts, as presumably u_A is pointing (very) roughly in the right direction,
I don’t think this seems too bad here—in effect, driving someone somewhere in a normal way is one kind of action, and normal AUP is too harsh. The question remains of whether this is problematic in general? I lean towards no, due to the way impact unit is calculated, but it deserves further consideration.
Intent verification does seem to preclude bad behavior here. As Rohin has pointed out, however, just because everything we can think of seems to have another part that is making sure nothing bad happens, the fact that these discrepancies arise should indeed give us pause.
We might have the agent just sitting in a lab, where the default action seems fine. The failure mode seems easy to avoid in general, although I could be wrong. I also have the intuition that any individual environment we would look at should be able to be configured through incrementation such that it’s fine.
Huh? No, N is in the denominator of the penalty term.
No, the utility functions are literally just over actions and observations. It’s true that among all computable utilities, some of the more complex ones will be doing something that we would deem to be grading a model of the actual world. This kind of thing is not necessary for the method to work.
Suppose that you receive 1 utility if you’re able to remain activated during the entire epoch. Then we see that Q_{u_1} becomes the probability of the agent ensuring it remains activated the whole time (this new “alien” agent does not have the impact measure restriction). As the agent gains optimization power and/or resources, this increases. This has nothing to do with anything actually going on the world, beyond what is naturally inferred from its model over what observations it will see in the future given what it has seen so far.