Zack_M_Davis comments on Common misconceptions about OpenAI

Zack_M_Davis 29 Aug 2022 0:13 UTC
21 points
13

RLHF (which are very obviously not even trying to solve any of the problems which could plausibly kill us)

Sorry for being dumb, but I thought the naïve case for RLHF is that it helps solve the problem of “people are very bad at manually writing down an explicit utility or reward function that does what they intuitively want”? Does that not count as one of the lethal problems (even if RLHF alone would kill us because of the other problems)? If one of the other problems is Goodharting/unforseen-maxima, it seems like RLHF could be helpful insofar as if RLHF rewards are quantitatively less misaligned than hand-coded rewards, you can get away with optimizing them harder before they kill you?
- johnswentworth 29 Aug 2022 16:53 UTC
  13 points
  5
  Parent
  That is a reasonable case, with the obvious catch that you don’t know how hard you can optimize before it goes wrong, and when it does go wrong you’re less likely to notice than with a hand-coded utility/reward.
  But I expect the people who work on RLHF do not expect an explicit utility/reward to be a problem which actually kills us, because they’d expect visible failures before it gets to the capability level of killing us. RLHF makes those visible failures less likely. Under that frame, it’s the lack of a warning shot which kills us.
  - Zack_M_Davis 30 Aug 2022 3:27 UTC
    3 points
    1
    Parent
    
    when it does go wrong you’re less likely to notice than with a hand-coded utility/reward [...] RLHF makes those visible failures less likely
    
    Because it incentivizes learning human models which can then be used to be more competently deceptive, or just because once you’ve fixed the problems you know how to notice, what’s left are the ones you don’t know how to notice? The latter doesn’t seem specific to RLHF (you’d have the same problem if people magically got better at hand-coding rewards), but I see how the former is plausible and bad.
    - johnswentworth 30 Aug 2022 4:58 UTC
      20 points
      8
      Parent
      The problem isn’t just learning whole human models. RLHF will select for any heuristic/strategy which, even by accident, hides bad behavior from humans. It applies even at low capabilities.