ryan_greenblatt comments on Think carefully before calling RL policies “agents”

ryan_greenblatt 14 Dec 2023 22:56 UTC
2 points
0

Since DPO is not considered reinforcement learning

Depending on the sampling process you use, I think you should consider this the same as RL.
- Jannes Elstner 16 Dec 2023 20:52 UTC
  1 point
  2
  Parent
  I’m not sure what you mean, in DPO you never sample from the language model. You only need the probabilities of the model producing the preference data, there isn’t any exploration.
  - ryan_greenblatt 16 Dec 2023 20:55 UTC
    2 points
    0
    Parent
    Doing multiple rounds of DPO where you sample from the LLM to get comparison pairs seems totally possible and might be the best way to use DPO in many cases.
    
    You can of course use DPO on data obtained from sources other than the LLM itself.
    - Jannes Elstner 16 Dec 2023 21:27 UTC
      1 point
      0
      Parent
      Interesting. I’m thinking that with “many cases” you mean cases where either manually annotating the data over multiple rounds is possible (cheap), or cases where the model is powerful enough to label the comparison pairs, and we get something like the DPO version of RLAIF. That does sound more like RL.
      - ryan_greenblatt 16 Dec 2023 23:57 UTC
        2 points
        0
        Parent
        
        manually annotating the data over multiple rounds is possible (cheap)
        
        I intended this.
        
        This is the same as normal RLHF. In practice the sample efficiency of DPO might be higher or lower than (e.g.) PPO based RLHF in various different cases.