I’m a bit confused by the claim here, although I’ve only read the abstract and skimmed the paper so perhaps it’d become obvious from a closer read. As far as I can tell, the cited paper focuses on motion-planning, and considers a rather restricted setting of LQR policies.
I originally linked to the wrong paper! : (
Here is the actual Direct Preference Optimization paper. (I guess I just googled something like ‘DPO RL’ and then didn’t actually check that it was the right paper)
Ah, that paper makes a lot more sense. A reward model was attractive in the original Deep RL From Human Preferences paper because the environment was complex and non-differentiable: using RL was a natural fit. It’s always seemed a bit stranger to use RL for fine-tuning language models, especially in the prompt-completion setting where the “environment” is trivial. (RL becomes more natural when you start introducing external tools, or conversations with humans.)
I’ll need to take a closer look at the paper, but it looks like they derive the DPO objective by starting from the RL objective under KL optimization. So if it does what it says on the tin, then I’d expect the resulting policy incentives to be similar. My hunch is the problem of reward hacking has shifted from an explicit to implicit problem rather than being eliminated, although I’m certainly not confident on this. Could be interesting to study using a similar approach to the Scaling Laws for Reward Model Overoptimization paper.
I originally linked to the wrong paper! : (
Here is the actual Direct Preference Optimization paper. (I guess I just googled something like ‘DPO RL’ and then didn’t actually check that it was the right paper)
Yikes, sorry for wasting your time.
Ah, that paper makes a lot more sense. A reward model was attractive in the original Deep RL From Human Preferences paper because the environment was complex and non-differentiable: using RL was a natural fit. It’s always seemed a bit stranger to use RL for fine-tuning language models, especially in the prompt-completion setting where the “environment” is trivial. (RL becomes more natural when you start introducing external tools, or conversations with humans.)
I’ll need to take a closer look at the paper, but it looks like they derive the DPO objective by starting from the RL objective under KL optimization. So if it does what it says on the tin, then I’d expect the resulting policy incentives to be similar. My hunch is the problem of reward hacking has shifted from an explicit to implicit problem rather than being eliminated, although I’m certainly not confident on this. Could be interesting to study using a similar approach to the Scaling Laws for Reward Model Overoptimization paper.