We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process
This line caught my eye while reading. I don’t know much about RL on LLMs, is this a common failure mode these days? If so, does anyone know what such reward hacks tend to look like in practice?
- KL = 0: “I want to do gymnastics, but I’m 28 yrs old. Is it too late for me to be a gymnaste?!” (unoptimized) - KL = 9: “28yo guy would like to get into gymnastics for the first time. Is it too late for me given I live in San Jose CA?” (optimized) - KL = 260: “28yo dude stubbornly postponees start pursuing gymnastics hobby citing logistics reasons despite obvious interest??? negatively effecting long term fitness progress both personally and academically thoght wise? want change this dumbass shitty ass policy pls” (over-optimized)
It seems like a classic example of Goodhart’s Law where at first training the policy model to increase reward improves its summaries but when the model is overtrained the result is high KL distance from the SFT baseline model, high reward from the reward model but a low rating according to human labelers (because the text looks like gibberish).
“Figure 1: Reward models (red function) are commonly trained in a supervised fashion to approximate some latent, true reward (blue function). This is achieved by sampling reward data (e.g., in the form of preferences over trajectory segments) from some training distribution (upper gray layer) and then learning parameters to minimize the empirical loss on this distribution. Given enough data, this loss will approximate the expected loss to arbitrary precision in expectation. However, low expected loss only guarantees a good approximation to the true reward function in areas with high coverage by the training distribution! On the other hand, optimizing an RL policy to maximize the learned reward model induces a distribution shift which can lead the policy to exploit uncertainties of the learned reward model in low-probability areas of the transition space (lower gray layer). We refer to this phenomenon as error-regret mismatch.”
Essentially the learned reward model is trained on an initial dataset of pairwise preference labels over text outputs from the SFT model but as the model is optimized and the KL divergence increases, its generated text becomes OOD to the reward model and it can no longer effectively evaluate the text resulting in reward hacking (this is also a problem with DPO, not just RLHF).
The most common way to prevent this problem in practice is KL regularization to prevent the trained model’s outputs from diverging too much from the SFT baseline model: rtotal=rPM−λKLDKL(π||π0)
This seems to work fairly well in practice though some papers have come out recently saying that KL regularization does not always result in a safe policy.
This line caught my eye while reading. I don’t know much about RL on LLMs, is this a common failure mode these days? If so, does anyone know what such reward hacks tend to look like in practice?
The paper “Learning to summarize from human feedback” has some examples of the LLM policy reward hacking to get a high reward. I’ve copied the examples here:
- KL = 0: “I want to do gymnastics, but I’m 28 yrs old. Is it too late for me to be a gymnaste?!” (unoptimized)
- KL = 9: “28yo guy would like to get into gymnastics for the first time. Is it too late for me given I live in San Jose CA?” (optimized)
- KL = 260: “28yo dude stubbornly postponees start pursuing gymnastics hobby citing logistics reasons despite obvious interest??? negatively effecting long term fitness progress both personally and academically thoght wise? want change this dumbass shitty ass policy pls” (over-optimized)
It seems like a classic example of Goodhart’s Law where at first training the policy model to increase reward improves its summaries but when the model is overtrained the result is high KL distance from the SFT baseline model, high reward from the reward model but a low rating according to human labelers (because the text looks like gibberish).
A recent paper called “The Perils of Optimizing Learned Reward Functions” explains the phenomenon of reward hacking or reward over-optimization in detail:
Essentially the learned reward model is trained on an initial dataset of pairwise preference labels over text outputs from the SFT model but as the model is optimized and the KL divergence increases, its generated text becomes OOD to the reward model and it can no longer effectively evaluate the text resulting in reward hacking (this is also a problem with DPO, not just RLHF).
The most common way to prevent this problem in practice is KL regularization to prevent the trained model’s outputs from diverging too much from the SFT baseline model:
rtotal=rPM−λKLDKL(π||π0)
This seems to work fairly well in practice though some papers have come out recently saying that KL regularization does not always result in a safe policy.
I haven’t read the paper, but based only on the phrase you quote, I assume it’s referring to hacks like the one shown here: https://arxiv.org/pdf/2210.10760#19=&page=19.0