Rohin Shah comments on Reward model hacking as a challenge for reward learning

Rohin Shah 12 Apr 2022 12:47 UTC
5 points
This post discusses an issue that could lead to catastrophically misaligned AI even when we have access to a perfect reward signal and there are no misaligned inner optimizers. Instead, the misalignment comes from the fact that our reward signal is too expensive to use directly for RL training, so we train a reward model, which is incorrect on some off-distribution transitions. The agent might then exploit these off-distribution deficiencies, which I’ll refer to as reward model hacking.
Fwiw, I would say that in this case you had an inner alignment failure in your training of the reward model.
(Or alternatively, I would think of the policy + reward model as a unified AI system, and then say that you had an inner alignment failure w.r.t the unified AI system.)
I’m not sure everyone would agree with this; I’ve found that different people mean different things by outer and inner alignment.