I’ve thought some more about the step-wise inaction counterfactual, and I think there are more issues with it beyond the human manipulation incentive. With the step-wise counterfactual, future transitions that are caused by the agent’s current actions will not be penalized, since by the time those transitions happen, they are included in the counterfactual. Thus, there is no penalty for a current transition that set in motion some effects that don’t happen immediately (this includes influencing humans), unless the whitelisting process takes into account that this transition causes these effects (e.g. using a causal model).
For example, if the agent puts a vase on a conveyor belt (which results in the vase breaking a few time steps later), it would only be penalized if the “vase near belt → vase on belt” transition is not in the whitelist, i.e. if the whitelisting process takes into account that the belt would eventually break the vase. There are also situations where penalizing the “vase near belt → vase on belt” transition would not make sense, e.g. if the agent works in a vase-making factory and the conveyor belt takes the vase to the next step in the manufacturing process. Thus, for this penalty to reliably work, the whitelisting process needs to take into account accurate task-specific causal information, which I think is a big ask. The agent would also not be penalized for butterfly effects that are difficult to model, so it would have an incentive to channel its impact through butterfly effects of whitelisted transitions.
So this issue is correct, as my post is written. I realized after the deadline that I hadn’t spelled this out at all, and I didn’t feel comfortable editing at that point; there’s a little clarification in the post now.
For each time step t=1,…,T, we’re running both of those effects() calls indefinitely. For each time step in the simulation, we penalize those effects which are only in the π:tM simulation at that (simulated) time step and which manifest under the full plan. This means that if M directly caused a side effect, it gets counted exactly once.
I agree that it’s a big ask, modeling butterfly effects like that, but the idea was to get an unbounded solution and see where that left us.
I’ve thought some more about the step-wise inaction counterfactual, and I think there are more issues with it beyond the human manipulation incentive. With the step-wise counterfactual, future transitions that are caused by the agent’s current actions will not be penalized, since by the time those transitions happen, they are included in the counterfactual. Thus, there is no penalty for a current transition that set in motion some effects that don’t happen immediately (this includes influencing humans), unless the whitelisting process takes into account that this transition causes these effects (e.g. using a causal model).
For example, if the agent puts a vase on a conveyor belt (which results in the vase breaking a few time steps later), it would only be penalized if the “vase near belt → vase on belt” transition is not in the whitelist, i.e. if the whitelisting process takes into account that the belt would eventually break the vase. There are also situations where penalizing the “vase near belt → vase on belt” transition would not make sense, e.g. if the agent works in a vase-making factory and the conveyor belt takes the vase to the next step in the manufacturing process. Thus, for this penalty to reliably work, the whitelisting process needs to take into account accurate task-specific causal information, which I think is a big ask. The agent would also not be penalized for butterfly effects that are difficult to model, so it would have an incentive to channel its impact through butterfly effects of whitelisted transitions.
So this issue is correct, as my post is written. I realized after the deadline that I hadn’t spelled this out at all, and I didn’t feel comfortable editing at that point; there’s a little clarification in the post now.
For each time step t=1,…,T, we’re running both of those effects() calls indefinitely. For each time step in the simulation, we penalize those effects which are only in the π:tM simulation at that (simulated) time step and which manifest under the full plan. This means that if M directly caused a side effect, it gets counted exactly once.
I agree that it’s a big ask, modeling butterfly effects like that, but the idea was to get an unbounded solution and see where that left us.