Interesting. I’m thinking that with “many cases” you mean cases where either manually annotating the data over multiple rounds is possible (cheap), or cases where the model is powerful enough to label the comparison pairs, and we get something like the DPO version of RLAIF. That does sound more like RL.
manually annotating the data over multiple rounds is possible (cheap)
I intended this.
This is the same as normal RLHF. In practice the sample efficiency of DPO might be higher or lower than (e.g.) PPO based RLHF in various different cases.
Interesting. I’m thinking that with “many cases” you mean cases where either manually annotating the data over multiple rounds is possible (cheap), or cases where the model is powerful enough to label the comparison pairs, and we get something like the DPO version of RLAIF. That does sound more like RL.
I intended this.
This is the same as normal RLHF. In practice the sample efficiency of DPO might be higher or lower than (e.g.) PPO based RLHF in various different cases.