It’s very impressive that this technique could be used alongside existing finetuning tools.
> According to our data, this technique stacks additively with both finetuning
To check my understanding, the evidence for this claim in the paper is Figure 13, where your method stacks with finetuning to increase sycophancy. But there are not currently results on decreasing sycophancy (or any other bad capability), where you show your method stacks with finetuning, right?
(AFAICT currently Figure 13 shows some evidence that activation addition to reduce sycophancy outcompetes finetuning, but you’re unsure about the statistical significance due to the low percentages involved)
Note that the finetuning for figure 13 is training the model on sycophantic/non-sycophantic multiple choice question answering and then generalizing this to free response.
It isn’t training more directly on sycophantic responses or performing RL for sycophancy.
We had results on decreasing sycophancy, as you say, but both methods zero it out in generalization. We’d need to test on a harder sycophancy dataset for that.
It’s very impressive that this technique could be used alongside existing finetuning tools.
> According to our data, this technique stacks additively with both finetuning
To check my understanding, the evidence for this claim in the paper is Figure 13, where your method stacks with finetuning to increase sycophancy. But there are not currently results on decreasing sycophancy (or any other bad capability), where you show your method stacks with finetuning, right?
(AFAICT currently Figure 13 shows some evidence that activation addition to reduce sycophancy outcompetes finetuning, but you’re unsure about the statistical significance due to the low percentages involved)
Note that the finetuning for figure 13 is training the model on sycophantic/non-sycophantic multiple choice question answering and then generalizing this to free response.
It isn’t training more directly on sycophantic responses or performing RL for sycophancy.
We had results on decreasing sycophancy, as you say, but both methods zero it out in generalization. We’d need to test on a harder sycophancy dataset for that.