Suppose we finetune the model to maximize the probability placed on answer A. If we train to convergence, that means that its sampling probabilities assign ~1 to A and ~0 to B. There is no more signal that naive finetuning can extract from this data.
As you note, one difference between supervised fine-tuning (SFT) and CAA is that when producing a steering vector, CAA places equal weight on every completion, while SFT doesn’t (see here for the derivative of log softmax, which I had to look up :) ).
I’m interested in what happens if you try SFT on all these problems with negative-log-likelihood loss, but you reweight the loss of different completions so that it’s as if every completion you train on was equally likely before training. In your example, if you had probability 0.8 on A and 0.2 on B, I unconfidently think that the correct reweighting is to weigh the B completion 4x as high as the A completion, because B was initially less likely.
I think it’s plausible that some/most of the improvements you see with your method would be captured by this modification to SFT.
I’d be interested in seeing this too. I think it sounds like it might work when the probabilities are similar (e.g. 0.8 and 0.2) but I would expect weird things to happen if it was like 1000x more confident in one answer.
(Relatedly, I’m pretty confused if the “just train multiple times” is the right way to do this, and if people have thought about ways to do this that don’t seem as janky)
Relatedly, I’m pretty confused if the “just train multiple times” is the right way to do this, and if people have thought about ways to do this that don’t seem as janky
I think DPO on contrast pairs seems like a pretty natural approach.
As you note, one difference between supervised fine-tuning (SFT) and CAA is that when producing a steering vector, CAA places equal weight on every completion, while SFT doesn’t (see here for the derivative of log softmax, which I had to look up :) ).
I’m interested in what happens if you try SFT on all these problems with negative-log-likelihood loss, but you reweight the loss of different completions so that it’s as if every completion you train on was equally likely before training. In your example, if you had probability 0.8 on A and 0.2 on B, I unconfidently think that the correct reweighting is to weigh the B completion 4x as high as the A completion, because B was initially less likely.
I think it’s plausible that some/most of the improvements you see with your method would be captured by this modification to SFT.
I’d be interested in seeing this too. I think it sounds like it might work when the probabilities are similar (e.g. 0.8 and 0.2) but I would expect weird things to happen if it was like 1000x more confident in one answer.
(Relatedly, I’m pretty confused if the “just train multiple times” is the right way to do this, and if people have thought about ways to do this that don’t seem as janky)
I think DPO on contrast pairs seems like a pretty natural approach.
I think “just train multiple times” or “multiply the update so that it’s as if you trained it repeatedly” probably are fine.