Trying to read the paper makes it painfully clear that I am banging against the limits of my technical chops. It feels like important stuff to know and get right, but as often happens I run into that point where my ability to wing it hits a wall and suddenly it turns into Greek.
My understanding of that whole story (corrections welcome!):
The starting problem is that we want to chisel human preferences into our LLM. We assemble a huge dataset of human-labeled data, consisting of the prompt, two sampled responses of the LLM to that prompt, and the human expressing a binary preference regarding which response they like more.
However, that dataset is obviously finite, and likely very painfully finite, since employing humans to label data is expensive and time-consuming. So here’s an idea: instead of fine-tuning our LLM on this dataset directly, we train a reward function based on this dataset. The hope is that this reward function will pick up on some underlying pattern in the human preferences, and so would let us evaluate human preferences on arbitrary examples (squeezing more RL/fine-tuning steps out of the dataset), and hopefully even generalize human preferences out-of-distribution.
The DPO paper basically shows that this doesn’t work. As it turns out, the RLHF’d LLM whose policy minimizes the learned reward function learns the same policy that we’d get if we’d just chiseled the initial dataset into our LLM. The whole “train a reward function” thing does nothing. And indeed: if there’s some underlying pattern to pick up on, it’d stand to reason that the LLM itself would pick up on it over the course of fine-tuning.[1][2]
However, here we run into some tricky problems with the technical implementation.
The intuitively natural way to chisel the human-preferences dataset into the LLM is to make up a loss term consisting of human-preference logits (i. e., take the dataset-defined probability p that the human would prefer response A to response B in a given case, and plug it into lnp1−p) and the KL divergence from the base LLM’s output to the fine-tuned LLM output (i. e., penalize straying too far from accurate text prediction).
The idea is that the first term would move the LLM towards satisfying human preferences, and the second term would prevent it from overfitting/being lobotomized into just rotely repeating the human-preferences dataset.
However, there’s a problem with the logit function: it explodes the closer p is to 1, approaching infinity (see the graph). Which means that on those values, it hopelessly dominates the KL-divergence term. And our dataset consists of binary labels, and human preferences are often deterministic (always prefer A to B, with probability ~1), so there’s a ton of near-1 ps there. Which means the “don’t get lobotomized” term gets ignored, and DPO lobotomizes the LLM even more than just RLHF.
This, ironically, is where the “useless” reward-function step actually helps. The reward model underfits, i. e. “softens” preferences, producing fewer p≈1, and therefore lobotomizing the LLM less. A textbook case of mistakes accidentally canceling each other out.
The General Theoretical Paradigm paper points all of the above out, and provides a less-naive way to score LLMs on human preference satisfaction which avoids the exponential explosion. (Or, rather, it finds a place in the equations in which we can wedge a regularization term such that we actually control the degree of lobotomization taking place – rather than setting it to the max (DPO) or kind of accidentally softening it (RLHF).)
Overall? Unless I dramatically misunderstood something, wow am I not impressed with this whole thing.
More technically: Suppose we have two reward functions r1(x,y) and r2(x,y), where y is the variable we’re varying in order to minimize them, and x is the (set of) all the other variables at play. If r1(x,y)−r2(x,y)=f(x), i. e. if the difference between these reward functions doesn’t depend on the variable we’re optimizing, then these reward functions are equivalent: if optimized-for, they’ll produce the same optimal value for y.
And then the paper shows that the fully-expanded reward function of RLHF is equivalent, in this sense, to the reward function of DPO/just chiseling-in the human-preferences dataset directly.
My understanding of that whole story (corrections welcome!):
The starting problem is that we want to chisel human preferences into our LLM. We assemble a huge dataset of human-labeled data, consisting of the prompt, two sampled responses of the LLM to that prompt, and the human expressing a binary preference regarding which response they like more.
However, that dataset is obviously finite, and likely very painfully finite, since employing humans to label data is expensive and time-consuming. So here’s an idea: instead of fine-tuning our LLM on this dataset directly, we train a reward function based on this dataset. The hope is that this reward function will pick up on some underlying pattern in the human preferences, and so would let us evaluate human preferences on arbitrary examples (squeezing more RL/fine-tuning steps out of the dataset), and hopefully even generalize human preferences out-of-distribution.
The DPO paper basically shows that this doesn’t work. As it turns out, the RLHF’d LLM whose policy minimizes the learned reward function learns the same policy that we’d get if we’d just chiseled the initial dataset into our LLM. The whole “train a reward function” thing does nothing. And indeed: if there’s some underlying pattern to pick up on, it’d stand to reason that the LLM itself would pick up on it over the course of fine-tuning.[1][2]
However, here we run into some tricky problems with the technical implementation.
The intuitively natural way to chisel the human-preferences dataset into the LLM is to make up a loss term consisting of human-preference logits (i. e., take the dataset-defined probability p that the human would prefer response A to response B in a given case, and plug it into lnp1−p) and the KL divergence from the base LLM’s output to the fine-tuned LLM output (i. e., penalize straying too far from accurate text prediction).
The idea is that the first term would move the LLM towards satisfying human preferences, and the second term would prevent it from overfitting/being lobotomized into just rotely repeating the human-preferences dataset.
However, there’s a problem with the logit function: it explodes the closer p is to 1, approaching infinity (see the graph). Which means that on those values, it hopelessly dominates the KL-divergence term. And our dataset consists of binary labels, and human preferences are often deterministic (always prefer A to B, with probability ~1), so there’s a ton of near-1 ps there. Which means the “don’t get lobotomized” term gets ignored, and DPO lobotomizes the LLM even more than just RLHF.
This, ironically, is where the “useless” reward-function step actually helps. The reward model underfits, i. e. “softens” preferences, producing fewer p≈1, and therefore lobotomizing the LLM less. A textbook case of mistakes accidentally canceling each other out.
The General Theoretical Paradigm paper points all of the above out, and provides a less-naive way to score LLMs on human preference satisfaction which avoids the exponential explosion. (Or, rather, it finds a place in the equations in which we can wedge a regularization term such that we actually control the degree of lobotomization taking place – rather than setting it to the max (DPO) or kind of accidentally softening it (RLHF).)
Overall? Unless I dramatically misunderstood something, wow am I not impressed with this whole thing.
Well, more precisely, that the fine-tuning training loop would itself chisel said pattern into the LLM, no reward-function proxy needed.
More technically: Suppose we have two reward functions r1(x,y) and r2(x,y), where y is the variable we’re varying in order to minimize them, and x is the (set of) all the other variables at play. If r1(x,y)−r2(x,y)=f(x), i. e. if the difference between these reward functions doesn’t depend on the variable we’re optimizing, then these reward functions are equivalent: if optimized-for, they’ll produce the same optimal value for y.
And then the paper shows that the fully-expanded reward function of RLHF is equivalent, in this sense, to the reward function of DPO/just chiseling-in the human-preferences dataset directly.