I like the proposed iterative formulation for the step-wise inaction counterfactual, though I would replace pi_Human with pi_Environment to account for environment processes that are not humans but can still “react” to the agent’s actions. The step-wise counterfactual also improves over the naive inaction counterfactual by avoiding repeated penalties for the same action, which could help avoid offsetting behaviors for a penalty that includes reversible effects.
However, as you point out, not penalizing the agent for human reactions to its actions introduces a manipulation incentive for the agent to channel its effects through humans, which seems potentially very bad. The tradeoff you identified is quite interesting, though I’m not sure whether penalizing the agent for human reactions necessarily leads to an incentive to put humans in stasis, since that is also quite a large effect (such a penalty could instead incentivize the agent to avoid undue influence on humans, which seems good). I think there might be a different tradeoff (for a penalty that incorporates reversible effects): between avoiding offsetting behaviors (where the stepwise counterfactual likely succeeds and the naive inaction counterfactual can fail) and avoiding manipulation incentives (where the stepwise counterfactual fails and the naive inaction counterfactual succeeds). I wonder if some sort of combination of these two counterfactuals could get around the tradeoff.
I’ve thought some more about the step-wise inaction counterfactual, and I think there are more issues with it beyond the human manipulation incentive. With the step-wise counterfactual, future transitions that are caused by the agent’s current actions will not be penalized, since by the time those transitions happen, they are included in the counterfactual. Thus, there is no penalty for a current transition that set in motion some effects that don’t happen immediately (this includes influencing humans), unless the whitelisting process takes into account that this transition causes these effects (e.g. using a causal model).
For example, if the agent puts a vase on a conveyor belt (which results in the vase breaking a few time steps later), it would only be penalized if the “vase near belt → vase on belt” transition is not in the whitelist, i.e. if the whitelisting process takes into account that the belt would eventually break the vase. There are also situations where penalizing the “vase near belt → vase on belt” transition would not make sense, e.g. if the agent works in a vase-making factory and the conveyor belt takes the vase to the next step in the manufacturing process. Thus, for this penalty to reliably work, the whitelisting process needs to take into account accurate task-specific causal information, which I think is a big ask. The agent would also not be penalized for butterfly effects that are difficult to model, so it would have an incentive to channel its impact through butterfly effects of whitelisted transitions.
So this issue is correct, as my post is written. I realized after the deadline that I hadn’t spelled this out at all, and I didn’t feel comfortable editing at that point; there’s a little clarification in the post now.
For each time step t=1,…,T, we’re running both of those effects() calls indefinitely. For each time step in the simulation, we penalize those effects which are only in the π:tM simulation at that (simulated) time step and which manifest under the full plan. This means that if M directly caused a side effect, it gets counted exactly once.
I agree that it’s a big ask, modeling butterfly effects like that, but the idea was to get an unbounded solution and see where that left us.
So I don’t know how we could quantify “stopping humans from having effects” as an effect without a strong offsetting incentive.
Let’s consider an alternate form of whitelisting, where we instead know the specific object-level transitions per time step that would have occurred in the naive counterfactual (where the agent does nothing). Discarding the whitelist, we instead penalize distance from the counterfactual latent-space transitions at that time step.
This basically locks us into a particular world-history. While this might be manipulation- and stasis-free, this is a different kind of clinginess. You’re basically saying “optimize this utility the best you can without letting there be an actual impact”. However, I actually hadn’t thought of this formulation before, and it’s plausible it’s even more desirable than whitelisting, as it seems to get us a low/no-impact agent semi-robustly. The trick is then allowing favorable effects to take place without getting back to stasis/manipulation.
There’s another problem, however: “people conclude that this AI design doesn’t work and try another variant” is a pretty plausible result of this naive counterfactual. When people imagine the counterfactual, it seems they usually think about “what would happen if the agent did nothing and then people shrugged and went about their lives, forgetting about AGI”. The odds of that being the counterfactual are pretty slim. It’s even possible that any agents/variants people would make in the counterfactual would have undefined behavior… Sufficiently-similar agents would also simulate what would happen if they did nothing, got tweaked and rebooted, and then ran the same simulation… where would it bottom out, and with what conclusion? Probably with a wholly-different kind of agent being tried out.
The iterative formulation doesn’t seem to have that failure mode.
Let’s consider an alternate form of whitelisting, where we instead know the specific object-level transitions per time step that would have occurred in the naive counterfactual (where the agent does nothing). Discarding the whitelist, we instead penalize distance from the counterfactual latent-space transitions at that time step.
How would you define a distance measure on transitions? Since this would be a continuous measure of how good transitions are, rather than a discrete list of good transitions, in what sense is it a form of whitelisting?
This basically locks us into a particular world-history. While this might be manipulation- and stasis-free, this is a different kind of clinginess. You’re basically saying “optimize this utility the best you can without letting there be an actual impact”. However, I actually hadn’t thought of this formulation before, and it’s plausible it’s even more desirable than whitelisting, as it seems to get us a low/no-impact agent semi-robustly. The trick is then allowing favorable effects to take place without getting back to stasis/manipulation.
I expect that in complex tasks where we don’t know the exact actions we would like the agent to take, this would prevent the agent from being useful or coming up with new unforeseen solutions. I have this concern about whitelisting in general, though giving the agent the ability to query the human about non-whitelisted effects is an improvement. The distance measure on transitions could also be traded off with reward (or some other task-specific objective function), so if an action is sufficiently useful for the task, the high reward would dominate the distance penalty.
This would still have offsetting issues though. In the asteroid example, if the agent deflects the asteroid, then future transitions (involving human actions) are very different from default transitions (involving no human actions), so the agent would have an offsetting incentive.
You’re right, it isn’t. I should have been more precise:
“Suppose we have an impact measure that considers whitelist-esque object transitions, but doesn’t use a whitelist. Instead, it penalizes how dissimilar the observed object transitions are at a time step to those which were counterfactually expected.”
I expect that in complex tasks where we don’t know the exact actions we would like the agent to take, this would prevent the agent from being useful or coming up with new unforeseen solutions. I have this concern about whitelisting in general, though giving the agent the ability to query the human about non-whitelisted effects is an improvement.
I think this failure mode on its own is relatively benign, given querying.
What I find more worrying is that an intelligent agent would likely be able to hard-optimize while avoiding penalties (either through the allowed transitions, by skating by on technicalities re: object recognition, etc).
I suspect the/a ideal solution will have far fewer parameters (if any).
I like the proposed iterative formulation for the step-wise inaction counterfactual, though I would replace pi_Human with pi_Environment to account for environment processes that are not humans but can still “react” to the agent’s actions. The step-wise counterfactual also improves over the naive inaction counterfactual by avoiding repeated penalties for the same action, which could help avoid offsetting behaviors for a penalty that includes reversible effects.
However, as you point out, not penalizing the agent for human reactions to its actions introduces a manipulation incentive for the agent to channel its effects through humans, which seems potentially very bad. The tradeoff you identified is quite interesting, though I’m not sure whether penalizing the agent for human reactions necessarily leads to an incentive to put humans in stasis, since that is also quite a large effect (such a penalty could instead incentivize the agent to avoid undue influence on humans, which seems good). I think there might be a different tradeoff (for a penalty that incorporates reversible effects): between avoiding offsetting behaviors (where the stepwise counterfactual likely succeeds and the naive inaction counterfactual can fail) and avoiding manipulation incentives (where the stepwise counterfactual fails and the naive inaction counterfactual succeeds). I wonder if some sort of combination of these two counterfactuals could get around the tradeoff.
I’ve thought some more about the step-wise inaction counterfactual, and I think there are more issues with it beyond the human manipulation incentive. With the step-wise counterfactual, future transitions that are caused by the agent’s current actions will not be penalized, since by the time those transitions happen, they are included in the counterfactual. Thus, there is no penalty for a current transition that set in motion some effects that don’t happen immediately (this includes influencing humans), unless the whitelisting process takes into account that this transition causes these effects (e.g. using a causal model).
For example, if the agent puts a vase on a conveyor belt (which results in the vase breaking a few time steps later), it would only be penalized if the “vase near belt → vase on belt” transition is not in the whitelist, i.e. if the whitelisting process takes into account that the belt would eventually break the vase. There are also situations where penalizing the “vase near belt → vase on belt” transition would not make sense, e.g. if the agent works in a vase-making factory and the conveyor belt takes the vase to the next step in the manufacturing process. Thus, for this penalty to reliably work, the whitelisting process needs to take into account accurate task-specific causal information, which I think is a big ask. The agent would also not be penalized for butterfly effects that are difficult to model, so it would have an incentive to channel its impact through butterfly effects of whitelisted transitions.
So this issue is correct, as my post is written. I realized after the deadline that I hadn’t spelled this out at all, and I didn’t feel comfortable editing at that point; there’s a little clarification in the post now.
For each time step t=1,…,T, we’re running both of those effects() calls indefinitely. For each time step in the simulation, we penalize those effects which are only in the π:tM simulation at that (simulated) time step and which manifest under the full plan. This means that if M directly caused a side effect, it gets counted exactly once.
I agree that it’s a big ask, modeling butterfly effects like that, but the idea was to get an unbounded solution and see where that left us.
So I don’t know how we could quantify “stopping humans from having effects” as an effect without a strong offsetting incentive.
Let’s consider an alternate form of whitelisting, where we instead know the specific object-level transitions per time step that would have occurred in the naive counterfactual (where the agent does nothing). Discarding the whitelist, we instead penalize distance from the counterfactual latent-space transitions at that time step.
This basically locks us into a particular world-history. While this might be manipulation- and stasis-free, this is a different kind of clinginess. You’re basically saying “optimize this utility the best you can without letting there be an actual impact”. However, I actually hadn’t thought of this formulation before, and it’s plausible it’s even more desirable than whitelisting, as it seems to get us a low/no-impact agent semi-robustly. The trick is then allowing favorable effects to take place without getting back to stasis/manipulation.
There’s another problem, however: “people conclude that this AI design doesn’t work and try another variant” is a pretty plausible result of this naive counterfactual. When people imagine the counterfactual, it seems they usually think about “what would happen if the agent did nothing and then people shrugged and went about their lives, forgetting about AGI”. The odds of that being the counterfactual are pretty slim. It’s even possible that any agents/variants people would make in the counterfactual would have undefined behavior… Sufficiently-similar agents would also simulate what would happen if they did nothing, got tweaked and rebooted, and then ran the same simulation… where would it bottom out, and with what conclusion? Probably with a wholly-different kind of agent being tried out.
The iterative formulation doesn’t seem to have that failure mode.
How would you define a distance measure on transitions? Since this would be a continuous measure of how good transitions are, rather than a discrete list of good transitions, in what sense is it a form of whitelisting?
I expect that in complex tasks where we don’t know the exact actions we would like the agent to take, this would prevent the agent from being useful or coming up with new unforeseen solutions. I have this concern about whitelisting in general, though giving the agent the ability to query the human about non-whitelisted effects is an improvement. The distance measure on transitions could also be traded off with reward (or some other task-specific objective function), so if an action is sufficiently useful for the task, the high reward would dominate the distance penalty.
This would still have offsetting issues though. In the asteroid example, if the agent deflects the asteroid, then future transitions (involving human actions) are very different from default transitions (involving no human actions), so the agent would have an offsetting incentive.
You’re right, it isn’t. I should have been more precise:
“Suppose we have an impact measure that considers whitelist-esque object transitions, but doesn’t use a whitelist. Instead, it penalizes how dissimilar the observed object transitions are at a time step to those which were counterfactually expected.”
I think this failure mode on its own is relatively benign, given querying.
What I find more worrying is that an intelligent agent would likely be able to hard-optimize while avoiding penalties (either through the allowed transitions, by skating by on technicalities re: object recognition, etc).
I suspect the/a ideal solution will have far fewer parameters (if any).