Thomas Kwa comments on Direct Preference Optimization in One Minute

Thomas Kwa 18 Aug 2023 0:35 UTC
LW: 2 AF: 1
0
AF
DPO seems like a step towards better and more fine-grained control over models than RLHF, because it removes the possibility that the reward model underfits.
- LawrenceC 18 Aug 2023 1:17 UTC
  LW: 2 AF: 1
  0
  AF Parent
  I suspect the underfitting explanation is probably a lot of what’s going on given the small models used by the authors. But in the case of larger, more capable models, why would you expect it to be underfitting instead of generalization (properly fitting)?
  - Thomas Kwa 18 Aug 2023 1:56 UTC
    LW: 3 AF: 1
    1
    AF Parent
    Maybe the reward models are expressive enough to capture all patterns in human preferences, but it seems nice to get rid of this assumption if we can. Scaling laws suggest that larger models perform better (in the Gao paper there is a gap between 3B and 6B reward model) so it seems reasonable that even the current largest reward models are not optimal.
    I guess it hasn’t been tested whether DPO scales better than RLHF. I don’t have enough experience with these techniques to have a view on whether it does.