Sam Marks comments on Thoughts on the impact of RLHF research

Sam Marks 2 Feb 2023 7:35 UTC
3 points
0
Ah, nice example! I now see your point, and I agree with everything you wrote. Whereas REINFORCE and SFT only incentivize actions which in fact were historically part of high-reward trajectories, Q-learning and actor-critic incentivize actions which comprise trajectories that one can infer would be high-reward (even if those actions never actually appeared in high-reward trajectories previously).