The Direct Preference Optimization (DPO) paper promises a more simple and efficient alternative to proximal policy optimization that is able to void the reward modeling phase, and thus optimize directly for the preferences expressed in preference data. This is achieved through the loss function:

$L_{D P O} (π_{θ}; π_{r e f}) = - E_{(x, y_{w}, y_{l}) \sim D} [log σ (β log \frac{π_{θ} (y_{w} | x)}{π_{r e f} (y_{w} | x)} - β log \frac{π_{θ} (y_{l} | x)}{π_{r e f} (y_{l} | x)})]$

Where:

$x$ is some prompt
$π_{θ} (y_{w} | x)$ and $π_{θ} (y_{l} | x)$ are the probabilities of the preferred and dispreferred completions under the current model.
$β$ is controls the deviation from the reference policy $π_{r e f}$ .

In essence, DPO computes the log probabilities of preferred and dispreferred completions under the current model and optimizes parameters to increase the likelihood of the preferred completions and decrease the likelihood of the dispreferred completions.

The authors share the following results:

“Figure 2: **Left.** The frontier of expected reward vs KL to the reference policy. DPO provides the highest expected reward for all KL values, demonstrating the quality of the optimization. **Right.** TL;DR summarization win rates vs. human-written summaries, using GPT-4 as evaluator. DPO exceeds PPO’s best-case performance on summarization, while being more robust to changes in the sampling temperature.”

“Figure 3: **Left.** Win rates computed by GPT-4 for Anthropic-HH one-step dialogue; DPO is the only method that improves over chosen summaries in the Anthropic-HH test set. **Right.** Win rates for different sampling temperatures over the course of training. DPO’s improvement over the dataset labels is fairly stable over the course of training for different sampling temperatures.”

We evaluate different methods by sampling completions on the test split of TL;DR summarization dataset, and computing the average win rate against reference completions in the test set. The completions for all methods are sampled at temperatures varying from 0.0 to 1.0, and the win rates are shown in Figure 2 (right). DPO, PPO and Preferred-FT all fine-tune the same GPT-J SFT model. We find that DPO has a win rate of approximately 61% at a temperature of 0.0, exceeding the performance of PPO at 57% at its optimal sampling temperature of 0.0. DPO also achieves a higher maximum win rate compared to the best of N baseline. We note that we did not meaningfully tune DPO’s β hyperparameter, so these results may underestimate DPO’s potential. Moreover, we find DPO to be much more robust to the sampling temperature than PPO, the performance of which can degrade to that of the base GPT-J model at high temperatures.

”Comparing human and GPT-4 win rates and per-judgment agreement on TL;DR summarization samples. Humans agree with GPT-4 about as much as they agree with each other. Each experiment compares a summary from the stated method with a summary from PPO with temperature 0.”

I have created a prediction market targeted at forecasting the likelihood that DPO is adopted by a Frontier Lab before Jan 1 2025.

Direct Preference Optimization in One Minute

lukemarks26 Jun 2023 11:52 UTC

LW: 22 AF: 7

3 comments2 min readLW link

Reinforcement learning AI

What links here?

Interpreting the Learning of Deceit by RogerDearnaley (18 Dec 2023 8:12 UTC; 30 points)

Thomas Kwa 18 Aug 2023 0:35 UTC
LW: 2 AF: 1
0
AF
DPO seems like a step towards better and more fine-grained control over models than RLHF, because it removes the possibility that the reward model underfits.
- LawrenceC 18 Aug 2023 1:17 UTC
  LW: 2 AF: 1
  0
  AF Parent
  I suspect the underfitting explanation is probably a lot of what’s going on given the small models used by the authors. But in the case of larger, more capable models, why would you expect it to be underfitting instead of generalization (properly fitting)?
  - Thomas Kwa 18 Aug 2023 1:56 UTC
    LW: 3 AF: 1
    1
    AF Parent
    Maybe the reward models are expressive enough to capture all patterns in human preferences, but it seems nice to get rid of this assumption if we can. Scaling laws suggest that larger models perform better (in the Gao paper there is a gap between 3B and 6B reward model) so it seems reasonable that even the current largest reward models are not optimal.
    I guess it hasn’t been tested whether DPO scales better than RLHF. I don’t have enough experience with these techniques to have a view on whether it does.