TLDR; a comparison of DPO and PPO (reward-based and reward-free) in relation to RLHF particularly why PPO performs poorly on academic benchmarks.
An excerpt from section 5. Key Factors to PPO for RLHF
We find three key techniques: (1) advantage normalization (Raffin et al., 2021), (2) large-batch-size training (Yu et al., 2022), and (3) updating the parameters of the reference model with exponential moving average (Ouyang et al., 2022).
From the ablation studies, it particularly finds large-batch-size training to be significantly beneficial especially on code generation tasks.
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
TLDR; a comparison of DPO and PPO (reward-based and reward-free) in relation to RLHF particularly why PPO performs poorly on academic benchmarks.
An excerpt from section 5. Key Factors to PPO for RLHF
From the ablation studies, it particularly finds large-batch-size training to be significantly beneficial especially on code generation tasks.
Might be worth following up to see how ORPO compares. (Initial results suggest it’s basically a better DPO.)
Also, another interesting detail is that PPO still shows superior performance on RLHF testbeds.