AdamGleave comments on AI Safety in a World of Vulnerable Machine Learning Systems

AdamGleave 20 Jun 2023 1:34 UTC
LW: 1 AF: 1
0
AF
Ah, that paper makes a lot more sense. A reward model was attractive in the original Deep RL From Human Preferences paper because the environment was complex and non-differentiable: using RL was a natural fit. It’s always seemed a bit stranger to use RL for fine-tuning language models, especially in the prompt-completion setting where the “environment” is trivial. (RL becomes more natural when you start introducing external tools, or conversations with humans.)
I’ll need to take a closer look at the paper, but it looks like they derive the DPO objective by starting from the RL objective under KL optimization. So if it does what it says on the tin, then I’d expect the resulting policy incentives to be similar. My hunch is the problem of reward hacking has shifted from an explicit to implicit problem rather than being eliminated, although I’m certainly not confident on this. Could be interesting to study using a similar approach to the Scaling Laws for Reward Model Overoptimization paper.