There are several independent design choices made by AUP, RR, and other impact measures, which could potentially be used in any combination. Here is a breakdown of design choices and what I think they achieve:
Baseline
Starting state: used by reversibility methods. Results in interference with other agents. Avoids ex post offsetting.
Inaction (initial branch): default setting in Low Impact AI and RR. Avoids interfering with other agent’s actions, but interferes with their reactions. Does not avoid ex post offsetting if the penalty for preventing events is nonzero.
Inaction (stepwise branch) with environment model rollouts: default setting in AUP, model rollouts are necessary for penalizing delayed effects. Avoids interference with other agents and ex post offsetting.
Core part of deviation measure
AUP: difference in attainable utilities between baseline and current state
RR: difference in state reachability between baseline and current state
Low impact AI: distance between baseline and current state
Function applied to core part of deviation measure
Absolute value: default setting in AUP and Low Impact AI. Results in penalizing both increase and reduction relative to baseline. This results in avoiding the survival incentive (satisfying the Corrigibility property given in AUP post) and in equal penalties for preventing and causing the same event (violating the Asymmetry property given in RR paper).
Truncation at 0: default setting in RR, results in penalizing only reduction relative to baseline. This results in unequal penalties for preventing and causing the same event (satisfying the Asymmetry property) and in not avoiding the survival incentive (violating the Corrigibility property).
Scaling
Hand-tuned: default setting in RR (sort of provisionally)
ImpactUnit: used by AUP
I think an ablation study is needed to try out different combinations of these design choices and investigate which of them contribute to which desiderata / experimental test cases. I intend to do this at some point (hopefully soon).
One thought: penalizing increase as well (absolute value) seems potentially incompatible with relative reachability. The agent would have an incentive to stop anyone from doing anything new in response to what the agent did (since these actions necessarily make some states more reachable). This might be the most intense clinginess incentive possible, and it’s not clear to what extent incorporating other design choices (like the stepwise counterfactual) will mitigate this. Stepwise helps AUP (as does indifference to exact world configuration), but the main reason I think clinginess might really be dealt with is IV.
The agent would have an incentive to stop anyone from doing anything new in response to what the agent did
I think that the stepwise counterfactual is sufficient to address this kind of clinginess: the agent will not have an incentive to take further actions to stop humans from doing anything new in response to its original action, since after the original action happens, the human reactions are part of the stepwise inaction baseline.
The penalty for the original action will take into account human reactions in the inaction rollout after this action, so the agent will prefer actions that result in humans changing fewer things in response. I’m not sure whether to consider this clinginess—if so, it might be useful to call it “ex ante clinginess” to distinguish from “ex post clinginess” (similar to your corresponding distinction for offsetting). The “ex ante” kind of clinginess is the same property that causes the agent to avoid scapegoating butterfly effects, so I think it’s a desirable property overall. Do you disagree?
I think it’s generally a good property as a reasonable person would execute it. The problem, however, is the bad ex ante clinginess plans, where the agent has an incentive to pre-emptively constrain our reactions as hard as it can (and this could be really hard).
The problem is lessened if the agent is agnostic to the specific details of the world, but like I said, it seems like we really need IV (or an improved successor to it) to cleanly cut off these perverse incentives.
I’m not sure I understand the connection to scapegoating for the agents we’re talking about; scapegoating is only permitted if credit assignment is explicitly part of the approach and there are privileged “agents” in the provided ontology.
There are several independent design choices made by AUP, RR, and other impact measures, which could potentially be used in any combination. Here is a breakdown of design choices and what I think they achieve:
Baseline
Starting state: used by reversibility methods. Results in interference with other agents. Avoids ex post offsetting.
Inaction (initial branch): default setting in Low Impact AI and RR. Avoids interfering with other agent’s actions, but interferes with their reactions. Does not avoid ex post offsetting if the penalty for preventing events is nonzero.
Inaction (stepwise branch) with environment model rollouts: default setting in AUP, model rollouts are necessary for penalizing delayed effects. Avoids interference with other agents and ex post offsetting.
Core part of deviation measure
AUP: difference in attainable utilities between baseline and current state
RR: difference in state reachability between baseline and current state
Low impact AI: distance between baseline and current state
Function applied to core part of deviation measure
Absolute value: default setting in AUP and Low Impact AI. Results in penalizing both increase and reduction relative to baseline. This results in avoiding the survival incentive (satisfying the Corrigibility property given in AUP post) and in equal penalties for preventing and causing the same event (violating the Asymmetry property given in RR paper).
Truncation at 0: default setting in RR, results in penalizing only reduction relative to baseline. This results in unequal penalties for preventing and causing the same event (satisfying the Asymmetry property) and in not avoiding the survival incentive (violating the Corrigibility property).
Scaling
Hand-tuned: default setting in RR (sort of provisionally)
ImpactUnit: used by AUP
I think an ablation study is needed to try out different combinations of these design choices and investigate which of them contribute to which desiderata / experimental test cases. I intend to do this at some point (hopefully soon).
This is a great breakdown!
One thought: penalizing increase as well (absolute value) seems potentially incompatible with relative reachability. The agent would have an incentive to stop anyone from doing anything new in response to what the agent did (since these actions necessarily make some states more reachable). This might be the most intense clinginess incentive possible, and it’s not clear to what extent incorporating other design choices (like the stepwise counterfactual) will mitigate this. Stepwise helps AUP (as does indifference to exact world configuration), but the main reason I think clinginess might really be dealt with is IV.
Thanks, glad you liked the breakdown!
I think that the stepwise counterfactual is sufficient to address this kind of clinginess: the agent will not have an incentive to take further actions to stop humans from doing anything new in response to its original action, since after the original action happens, the human reactions are part of the stepwise inaction baseline.
The penalty for the original action will take into account human reactions in the inaction rollout after this action, so the agent will prefer actions that result in humans changing fewer things in response. I’m not sure whether to consider this clinginess—if so, it might be useful to call it “ex ante clinginess” to distinguish from “ex post clinginess” (similar to your corresponding distinction for offsetting). The “ex ante” kind of clinginess is the same property that causes the agent to avoid scapegoating butterfly effects, so I think it’s a desirable property overall. Do you disagree?
I think it’s generally a good property as a reasonable person would execute it. The problem, however, is the bad ex ante clinginess plans, where the agent has an incentive to pre-emptively constrain our reactions as hard as it can (and this could be really hard).
The problem is lessened if the agent is agnostic to the specific details of the world, but like I said, it seems like we really need IV (or an improved successor to it) to cleanly cut off these perverse incentives.
I’m not sure I understand the connection to scapegoating for the agents we’re talking about; scapegoating is only permitted if credit assignment is explicitly part of the approach and there are privileged “agents” in the provided ontology.