Edit: This comment now seems kinda silly as you basically addressed this in your comment and I missed it, feel free to ignore.
Also, as I predicted, the benefits stack with those of finetuning and in-context learning.
For the task of removing sycophancy this isn’t clearly true right? As you note in the linked post:
Very low sycophancy is achieved both by negative finetuning and subtracting the sycophancy vector. The rate is too low to examine how well the interventions stack with each other.
TBC, it could be that there are some settings where removing sycophancy using the most natural and straightforward training strategy (e.g. DPO on contrast pairs) only goes part way and stacking activation addition goes further. But I don’t think the linked post shows this.
(Separately, the comparison in the linked post is when generalizing from multiple choice question answering to free response. This seems like a pretty unnatural way to do the finetuning and I expect finetuning works better using more natural approaches. Of course, this generalization could still be interesting.)
Edit: This comment now seems kinda silly as you basically addressed this in your comment and I missed it, feel free to ignore.
For the task of removing sycophancy this isn’t clearly true right? As you note in the linked post:
TBC, it could be that there are some settings where removing sycophancy using the most natural and straightforward training strategy (e.g. DPO on contrast pairs) only goes part way and stacking activation addition goes further. But I don’t think the linked post shows this.
(Separately, the comparison in the linked post is when generalizing from multiple choice question answering to free response. This seems like a pretty unnatural way to do the finetuning and I expect finetuning works better using more natural approaches. Of course, this generalization could still be interesting.)