it equally penalizes the agent for causing event A and for preventing event A
Well, there is some asymmetry due to approval incentives. It isn’t very clear to what extent we can rely on these at the moment (although I think they’re probably quite strong). Also, the agent is more inclined to have certain impacts, as presumably u_A is pointing (very) roughly in the right direction,
this greatly reduces the granularity of the penalty, making credit assignment more difficult.
I don’t think this seems too bad here—in effect, driving someone somewhere in a normal way is one kind of action, and normal AUP is too harsh. The question remains of whether this is problematic in general? I lean towards no, due to the way impact unit is calculated, but it deserves further consideration.
This effectively uses the initial-branch inaction baseline (branching off when the self-driving car is launched) instead of the stepwise inaction baseline, which means getting clinginess issues back, in the sense of the agent being penalized for human reactions to the self-driving car.
Intent verification does seem to preclude bad behavior here. As Rohin has pointed out, however, just because everything we can think of seems to have another part that is making sure nothing bad happens, the fact that these discrepancies arise should indeed give us pause.
You may not be able to predict in advance when the agent will encounter situations where the default action is irreversible or otherwise undesirable.
We might have the agent just sitting in a lab, where the default action seems fine. The failure mode seems easy to avoid in general, although I could be wrong. I also have the intuition that any individual environment we would look at should be able to be configured through incrementation such that it’s fine.
Well, there is some asymmetry due to approval incentives. It isn’t very clear to what extent we can rely on these at the moment (although I think they’re probably quite strong). Also, the agent is more inclined to have certain impacts, as presumably u_A is pointing (very) roughly in the right direction,
I don’t think this seems too bad here—in effect, driving someone somewhere in a normal way is one kind of action, and normal AUP is too harsh. The question remains of whether this is problematic in general? I lean towards no, due to the way impact unit is calculated, but it deserves further consideration.
Intent verification does seem to preclude bad behavior here. As Rohin has pointed out, however, just because everything we can think of seems to have another part that is making sure nothing bad happens, the fact that these discrepancies arise should indeed give us pause.
We might have the agent just sitting in a lab, where the default action seems fine. The failure mode seems easy to avoid in general, although I could be wrong. I also have the intuition that any individual environment we would look at should be able to be configured through incrementation such that it’s fine.