Wei Dai comments on [AN #62] Are adversarial examples caused by real but imperceptible features?

Wei Dai 25 Aug 2019 4:09 UTC
LW: 6 AF: 3
AF
I found that the Recursive Reward Modeling paper talks about my concern under the Reward Gaming section. However it is kind of short on ideas for solving it:

Reward gaming Opportunities for reward gaming arise when the reward function incorrectly provides high reward to some undesired behavior (Clark & Amodei, 2016; Lehman et al., 2018); see Figure 4 for a concrete example. One potential source for reward gaming is the reward model’s vulnerability to adversarial inputs (Szegedy et al., 2013). If the environment is complex enough, the agent might figure out how to specifically craft these adversarially perturbed inputs in order to trick the reward model into providing higher reward than the user intends. Unlike in most work on generating adversarial examples (Goodfellow et al., 2015; Huang et al., 2017), the agent would not necessarily be free to synthesize any possible input to the reward model, but would need to find a way to realize adversarial observation sequences in its environment.

Reward gaming problems are in principle solvable by improving the reward model. Whether this means that reward gaming problems can also be overcome in practice is arguably one of the biggest open questions and possibly the greatest weakness of reward modeling. Yet there are a few examples from the literature indicating that reward gaming can be avoided in practice. Reinforcement learning from a learned reward function has been successful in gridworlds (Bahdanau et al., 2018), Atari games (Christiano et al., 2017; Ibarz et al., 2018), and continuous motor control tasks (Ho & Ermon, 2016; Christiano et al., 2017).

It seems to me that reward gaming not happening in practice is probably the result of limited computation applied to RL (resulting in a kind of implicit mild optimization) so that result would likely not be robust to scaling up. In the limit of infinite computation, standard RL algorithms should end up training models that maximize reward, which means doing reward gaming if that is what maximizes reward. (Perhaps with current RL algorithms, this isn’t a concern in practice because it would take an impractically large amount of computation to actually end up with reward gaming / exploitative behavior, but it doesn’t seem safe to assume that this will continue to be true despite future advances in RL technology.)

It currently feels to me like the problems with imitation learning might be less serious and easier to solve than the reward gaming problem with RL, so I’m still wondering about the move from SL/imitation to RL for IDA. Also, if there’s not existing discussion about this that I’ve missed, I’m surprised that there has been so little discussion about this issue.