<@Impact measures@>(@Measuring and avoiding side effects using relative reachability@) usually require a baseline state, relative to which we define impact. The choice of this baseline has important effects on the impact measure’s properties: for example, the popular stepwise inaction baseline (where at every step the effect of the current action is compared to doing nothing) does not generate incentives to interfere with environment processes or to offset the effects of its own actions. However, it ignores delayed impacts and lacks incentive to offset unwanted delayed effects once they are set in motion.
This points to a **tradeoff** between **penalizing delayed effects** (which is always desirable) and **avoiding offsetting incentives**, which is desirable if the effect to be offset is part of the objective and undesirable if it is not. We can circumvent the tradeoff by **modifying the task reward**: If the agent is only rewarded in states where the task remains solved, incentives to offset effects that contribute to solving the task are weakened. In that case, the initial inaction baseline (which compares the current state with the state that would have occurred if the agent had done nothing until now) deals better with delayed effects and correctly incentivizes offsetting for effects that are irrelevant for the task, while the incentives for offsetting task-relevant effects are balanced out by the task reward. If modifying the task reward is infeasible, similar properties can be achieved in the case of sparse rewards by setting the baseline to the last state in which a reward was achieved. To make the impact measure defined via the time-dependent initial inaction baseline **Markovian**, we could sample a single baseline state from the inaction rollout or compute a single penalty at the start of the episode, comparing the inaction rollout to a rollout of the agent policy.
Flo’s opinion:
I like the insight that offsetting is not always bad and the idea of dealing with the bad cases using the task reward. State-based reward functions that capture whether or not the task is currently done also intuitively seem like the correct way of specifying rewards in cases where achieving the task does not end the episode.
Looks great, thanks! Minor point: in the sparse reward case, rather than “setting the baseline to the last state in which a reward was achieved”, we set the initial state of the inaction baseline to be this last rewarded state, and then apply noops from this initial state to obtain the baseline state (otherwise this would be a starting state baseline rather than an inaction baseline).
Flo’s summary for the Alignment Newsletter:
Flo’s opinion:
Looks great, thanks! Minor point: in the sparse reward case, rather than “setting the baseline to the last state in which a reward was achieved”, we set the initial state of the inaction baseline to be this last rewarded state, and then apply noops from this initial state to obtain the baseline state (otherwise this would be a starting state baseline rather than an inaction baseline).
Good point, changed
to