Hey Christopher, this is really cool work. I think your idea of representation tuning is a very nice way to combine activation steering and fine-tuning. Do you have any intuition as to why fine-tuning towards the steering vector sometimes works better than simply steering towards it?
If you keep on working on this I’d be interested to see a more thorough evaluation of capabilities (more than just perplexity) by running it on some standard LM benchmarks. Whether the model retains its capabilities seems important to understand the safety-capabilities trade-off of this method.
I’m curious whether you tried adding some way to retain general capabilities into the loss function with which you do representation-tuning? E.g. to regularise the activations to stay closer to the original activations or by adding some standard Language Modelling loss?
As a nitpick: I think when measuring the Robustness of Tuned models the comparison advantages the honesty-tuned model. If I understand correctly the honesty-tuned model was specifically trained to be less like the vector used for dishonesty steering, whereas the truth-tuned model hasn’t been. Maybe a more fair comparison would be using automatic adversarial attack methods like GCG?
Hi, Jan, thanks for the feedback! I suspect that fine-tuning had a stronger impact on output than steering in this case partly because it was easier to find an optimal value for the amount of tuning than it was for steering, and partly because the tuning is there for every token; note in Figure 2C how the dishonesty direction is first “activated” a few tokens before generation. It would be interesting to look at exactly how the weights were changed and see if any insights can be gleaned from that.
I definitely agree about the more robust capabilities evaluations. To me it seems that this approach has real safety potential, but for that to be proven requires more analysis; it’ll just require some time to do.
Regarding adding a way to retain general capabilities, that was actually my original idea; I had a dual loss, with the other one being a standard token-based loss. But it just turned out to be difficult to get right and not necessary in this case. After writing this up, I was alerted to the Zou et al Circuit Breakers paper which did something similar but more sophisticated; I might try to adapt their approach.
Finally, the truth/lie tuned-models followed an existing approach in the literature to which I was offering an alternative, so a head-to-head comparison seemed fair; both approaches produce honest/dishonest models, it just seems that the representation tuning one is more robust to steering. TBH I’m not familiar with GCG, but I’ll check it out. Thanks for pointing it out.
Hey Christopher, this is really cool work. I think your idea of representation tuning is a very nice way to combine activation steering and fine-tuning. Do you have any intuition as to why fine-tuning towards the steering vector sometimes works better than simply steering towards it?
If you keep on working on this I’d be interested to see a more thorough evaluation of capabilities (more than just perplexity) by running it on some standard LM benchmarks. Whether the model retains its capabilities seems important to understand the safety-capabilities trade-off of this method.
I’m curious whether you tried adding some way to retain general capabilities into the loss function with which you do representation-tuning? E.g. to regularise the activations to stay closer to the original activations or by adding some standard Language Modelling loss?
As a nitpick: I think when measuring the Robustness of Tuned models the comparison advantages the honesty-tuned model. If I understand correctly the honesty-tuned model was specifically trained to be less like the vector used for dishonesty steering, whereas the truth-tuned model hasn’t been. Maybe a more fair comparison would be using automatic adversarial attack methods like GCG?
Again, I think this is a very cool project!
Hi, Jan, thanks for the feedback! I suspect that fine-tuning had a stronger impact on output than steering in this case partly because it was easier to find an optimal value for the amount of tuning than it was for steering, and partly because the tuning is there for every token; note in Figure 2C how the dishonesty direction is first “activated” a few tokens before generation. It would be interesting to look at exactly how the weights were changed and see if any insights can be gleaned from that.
I definitely agree about the more robust capabilities evaluations. To me it seems that this approach has real safety potential, but for that to be proven requires more analysis; it’ll just require some time to do.
Regarding adding a way to retain general capabilities, that was actually my original idea; I had a dual loss, with the other one being a standard token-based loss. But it just turned out to be difficult to get right and not necessary in this case. After writing this up, I was alerted to the Zou et al Circuit Breakers paper which did something similar but more sophisticated; I might try to adapt their approach.
Finally, the truth/lie tuned-models followed an existing approach in the literature to which I was offering an alternative, so a head-to-head comparison seemed fair; both approaches produce honest/dishonest models, it just seems that the representation tuning one is more robust to steering. TBH I’m not familiar with GCG, but I’ll check it out. Thanks for pointing it out.