I wonder why Gemini used RLHF instead of Direct Preference Optimization (DPO). DPO was written up 6 months ago; it’s simpler and apparently more compute-efficient than RLHF.
Is the Gemini org structure so sclerotic that it couldn’t switch to a more efficient training algorithm partway through a project?
Is DPO inferior to RLHF in some way? Lower quality, less efficient, more sensitive to hyperparameters?
Maybe they did use DPO, even though they claimed it was RLHF in their technical report?
I wonder why Gemini used RLHF instead of Direct Preference Optimization (DPO). DPO was written up 6 months ago; it’s simpler and apparently more compute-efficient than RLHF.
Is the Gemini org structure so sclerotic that it couldn’t switch to a more efficient training algorithm partway through a project?
Is DPO inferior to RLHF in some way? Lower quality, less efficient, more sensitive to hyperparameters?
Maybe they did use DPO, even though they claimed it was RLHF in their technical report?