Arthur Conmy comments on Steering Llama-2 with contrastive activation additions

Arthur Conmy 8 Jan 2024 13:03 UTC
LW: 3 AF: 3
2
AF
It’s very impressive that this technique could be used alongside existing finetuning tools.

> According to our data, this technique stacks additively with both finetuning

To check my understanding, the evidence for this claim in the paper is Figure 13, where your method stacks with finetuning to increase sycophancy. But there are not currently results on decreasing sycophancy (or any other bad capability), where you show your method stacks with finetuning, right?

(AFAICT currently Figure 13 shows some evidence that activation addition to reduce sycophancy outcompetes finetuning, but you’re unsure about the statistical significance due to the low percentages involved)
- ryan_greenblatt 8 Jan 2024 16:44 UTC
  3 points
  1
  Parent
  Note that the finetuning for figure 13 is training the model on sycophantic/non-sycophantic multiple choice question answering and then generalizing this to free response.
  
  It isn’t training more directly on sycophantic responses or performing RL for sycophancy.
- TurnTrout 8 Jan 2024 18:12 UTC
  LW: 2 AF: 2
  0
  AF Parent
  We had results on decreasing sycophancy, as you say, but both methods zero it out in generalization. We’d need to test on a harder sycophancy dataset for that.