Relatedly, I’m pretty confused if the “just train multiple times” is the right way to do this, and if people have thought about ways to do this that don’t seem as janky
I think DPO on contrast pairs seems like a pretty natural approach.
I think DPO on contrast pairs seems like a pretty natural approach.