ryan_greenblatt comments on Think carefully before calling RL policies “agents”

ryan_greenblatt 16 Dec 2023 23:57 UTC
2 points
0

manually annotating the data over multiple rounds is possible (cheap)

I intended this.

This is the same as normal RLHF. In practice the sample efficiency of DPO might be higher or lower than (e.g.) PPO based RLHF in various different cases.