I’d be interested in seeing this too. I think it sounds like it might work when the probabilities are similar (e.g. 0.8 and 0.2) but I would expect weird things to happen if it was like 1000x more confident in one answer.
(Relatedly, I’m pretty confused if the “just train multiple times” is the right way to do this, and if people have thought about ways to do this that don’t seem as janky)
Relatedly, I’m pretty confused if the “just train multiple times” is the right way to do this, and if people have thought about ways to do this that don’t seem as janky
I think DPO on contrast pairs seems like a pretty natural approach.
I’d be interested in seeing this too. I think it sounds like it might work when the probabilities are similar (e.g. 0.8 and 0.2) but I would expect weird things to happen if it was like 1000x more confident in one answer.
(Relatedly, I’m pretty confused if the “just train multiple times” is the right way to do this, and if people have thought about ways to do this that don’t seem as janky)
I think DPO on contrast pairs seems like a pretty natural approach.
I think “just train multiple times” or “multiply the update so that it’s as if you trained it repeatedly” probably are fine.