ryan_greenblatt comments on AI Safety in a World of Vulnerable Machine Learning Systems

ryan_greenblatt 19 Jun 2023 4:31 UTC
LW: 1 AF: 1
0
AF

Suppose we condition on RLHF failing. At a high level, failures split into: (a) human labelers rewarded the wrong thing (e.g. fooling humans); (b) the reward model failed to predict human labelers judgement and rewarded the wrong thing (e.g. reward hacking); (c) RL produced a policy that is capable enough to be dangerous but is optimizing something other than the reward model (e.g. mesa-optimization).

Minor point, feel free to ignore.

FWIW, I typically use ‘reward hacking’ to refer to just (a) here. I’d just call (b) ‘poor reward model sample efficiency’. That said, I more centrally use ‘reward hacking’ to describe hacking a reward process based on outcomes via stuff like ‘sensor tampering’, but this is still a subset of RLHF: the subset where humans look at outcomes and then assess reward taking this into account.
- AdamGleave 20 Jun 2023 1:36 UTC
  LW: 1 AF: 1
  1
  AF Parent
  Oh, we’re using terminology quite differently then. I would not call (a) reward hacking, as I view the model as being the reward (to the RL process), whereas humans are not providing reward at all (but rather some data that gets fed into a reward model’s learning process). I don’t especially care about what definitions we use here, but do wonder if this means we’re speaking past each other in other areas as well.