peterbarnett comments on Steering Llama-2 with contrastive activation additions

peterbarnett 2 Jan 2024 19:31 UTC
1 point
0
I’d be interested in seeing this too. I think it sounds like it might work when the probabilities are similar (e.g. 0.8 and 0.2) but I would expect weird things to happen if it was like 1000x more confident in one answer.
(Relatedly, I’m pretty confused if the “just train multiple times” is the right way to do this, and if people have thought about ways to do this that don’t seem as janky)
- ryan_greenblatt 2 Jan 2024 20:01 UTC
  7 points
  4
  Parent
  
  Relatedly, I’m pretty confused if the “just train multiple times” is the right way to do this, and if people have thought about ways to do this that don’t seem as janky
  
  I think DPO on contrast pairs seems like a pretty natural approach.
- Buck 2 Jan 2024 21:23 UTC
  3 points
  0
  Parent
  I think “just train multiple times” or “multiply the update so that it’s as if you trained it repeatedly” probably are fine.