Richard_Ngo comments on Thoughts on the impact of RLHF research

Richard_Ngo 1 Feb 2023 6:45 UTC
8 points
2
I think your example is too simple to capture the relevant phenomenon. Here’s one which does: suppose state s3 gives high reward, state s4 gives medium reward, and state s5 gives low reward. You’ve seen the following trajectories:
s2 → s3
s1 → s4
s1 → s2 → s5
Then q-learning will learn quickly that it should go s1 → s2 → s3, whereas REINFORCE and SFT will need to do further exploration before learning that.
I feel uncertain about how to think about the implications of this claim in the context of more complex environments, though. In some sense it only happens because q-learning is doing a one-step lookahead, which isn’t really scalable. (That also isn’t true of all critics.)
It feels like I might have just come up with a new name for “RL algorithms which work on offline data”, which is presumably not a crucial distinction.
- Sam Marks 2 Feb 2023 7:35 UTC
  3 points
  0
  Parent
  Ah, nice example! I now see your point, and I agree with everything you wrote. Whereas REINFORCE and SFT only incentivize actions which in fact were historically part of high-reward trajectories, Q-learning and actor-critic incentivize actions which comprise trajectories that one can infer would be high-reward (even if those actions never actually appeared in high-reward trajectories previously).