Rohin Shah comments on AXRP Episode 7 - Side Effects with Victoria Krakovna

Rohin Shah 16 May 2021 21:59 UTC
LW: 4 AF: 3
AF
Planned summary for the Alignment Newsletter:
This podcast goes over the problem of side effects, and impact regularization as an approach to handle this problem. The core hope is that impact regularization would enable “minimalistic” value alignment, in which the AI system may not be doing exactly what we want, but at the very least it will not take high impact actions that could cause an existential catastrophe.
An impact regularization method typically consists of a _deviation measure_ and a _baseline_. The baseline is what we compare the agent to, in order to determine whether it had an “impact”. The deviation measure is used to quantify how much impact there has been, when comparing the state generated by the agent to the one generated by the baseline.
Deviation measures are relatively uncontroversial – there are several possible measures, but they all seem to do relatively similar things, and there aren’t any obviously bad outcomes traceable to problems with the deviation measure. However, that is not the case with baselines. One typical baseline is the **inaction** baseline, where you compare against what would have happened if the agent had done nothing. Unfortunately, this leads to _offsetting_: as a simple example, if some food was going to be thrown away and the agent rescues it, it then has an incentive to throw it away again, since that would minimize impact relative to the case where it had done nothing. A solution is the **stepwise inaction** baseline, which compares to the case where the agent does nothing starting from the previous state (instead of from the beginning of time). However, this then prevents some beneficial offsetting: for example, if the agent opens the door to leave the house, then the agent is incentivized to leave the door open.
As a result, the author is interested in seeing more work on baselines for impact regularization. In addition, she wants to see impact regularization tested in more realistic scenarios. That being said, she thinks that the useful aspect of impact regularization research so far is in bringing conceptual clarity to what we are trying to do with AI safety, and in identifying the interference and offsetting behaviors, and the incentives for them.