manually annotating the data over multiple rounds is possible (cheap)
I intended this.
This is the same as normal RLHF. In practice the sample efficiency of DPO might be higher or lower than (e.g.) PPO based RLHF in various different cases.
I intended this.
This is the same as normal RLHF. In practice the sample efficiency of DPO might be higher or lower than (e.g.) PPO based RLHF in various different cases.