Rohin Shah comments on Arguments against myopic training

Rohin Shah 9 Jul 2020 19:51 UTC
LW: 5 AF: 4
AF
Planned summary for the Alignment Newsletter:
Several (AN #34) <@proposals@>(@An overviAN #34 t involve some form of myopic training, in which an AI system is trained to take actions that only maximize the feedback signal in the **next timestep** (rather than e.g. across an episode, or across all time, as with typical reward signals). In order for this to work, the feedback signal needs to take into account the future consequences of the AI system’s action, in order to incentivize good behavior, and so providing feedback becomes more challenging.
This post argues that there don’t seem to be any major benefits of myopic training, and so it is not worth the cost we pay in having to provide more challenging feedback. In particular, myopic training does not necessarily lead to “myopic cognition”, in which the agent doesn’t think about long-term consequences when choosing an action. To see this, consider the case where we know the ideal reward function R*. In that case, the best feedback to give for myopic training is the optimal Q-function Q*. However, regardless of whether we do regular training with R* or myopic training with Q*, the agent would do well if it estimates Q* in order to select the right action to take, which in turn will likely require reasoning about long-term consequences of its actions. So there doesn’t seem to be a strong reason to expect myopic training to lead to myopic cognition, if we give feedback that depends on (our predictions of) long-term consequences. In fact, for any approval feedback we may give, there is an equivalent reward feedback that would incentivize the same optimal policy.
Another argument for myopic training is that it prevents reward tampering and manipulation of the supervisor. The author doesn’t find this compelling. In the case of reward tampering, it seems that agents would not catastrophically tamper with their reward “by accident”, as tampering is difficult to do, and so they would only do so intentionally, in which case it is important for us to prevent those intentions from arising, for which we shouldn’t expect myopic training to help very much. In the case of manipulating the supervisor, he argues that in the case of myopic training, the supervisor will have to think about the future outputs of the agent in order to be competitive, which could lead to manipulation anyway.
Planned opinion:
I agree with what I see as the key point of this post: myopic training does not mean that the resulting agent will have myopic cognition. However, I don’t think this means myopic training is useless. According to me, the main benefit of myopic training is that small errors in reward specification for regular RL can incentivize catastrophic outcomes, while small errors in approval feedback for myopic RL are unlikely to incentivize catastrophic outcomes. (This is because “simple” rewards that we specify often lead to <@convergent instrumental subgoals@>(@The Basic AI Drives@), which need not be the case for approval feedback.) More details in this comment.