Suppose we condition on RLHF failing. At a high level, failures split into: (a) human labelers rewarded the wrong thing (e.g. fooling humans); (b) the reward model failed to predict human labelers judgement and rewarded the wrong thing (e.g. reward hacking); (c) RL produced a policy that is capable enough to be dangerous but is optimizing something other than the reward model (e.g. mesa-optimization).
Minor point, feel free to ignore.
FWIW, I typically use ‘reward hacking’ to refer to just (a) here. I’d just call (b) ‘poor reward model sample efficiency’. That said, I more centrally use ‘reward hacking’ to describe hacking a reward process based on outcomes via stuff like ‘sensor tampering’, but this is still a subset of RLHF: the subset where humans look at outcomes and then assess reward taking this into account.
Oh, we’re using terminology quite differently then. I would not call (a) reward hacking, as I view the model as being the reward (to the RL process), whereas humans are not providing reward at all (but rather some data that gets fed into a reward model’s learning process). I don’t especially care about what definitions we use here, but do wonder if this means we’re speaking past each other in other areas as well.
Minor point, feel free to ignore.
FWIW, I typically use ‘reward hacking’ to refer to just (a) here. I’d just call (b) ‘poor reward model sample efficiency’. That said, I more centrally use ‘reward hacking’ to describe hacking a reward process based on outcomes via stuff like ‘sensor tampering’, but this is still a subset of RLHF: the subset where humans look at outcomes and then assess reward taking this into account.
Oh, we’re using terminology quite differently then. I would not call (a) reward hacking, as I view the model as being the reward (to the RL process), whereas humans are not providing reward at all (but rather some data that gets fed into a reward model’s learning process). I don’t especially care about what definitions we use here, but do wonder if this means we’re speaking past each other in other areas as well.