TurnTrout comments on Modulating sycophancy in an RLHF model via activation steering

TurnTrout 2 Jan 2024 4:59 UTC
LW: 4 AF: 3
0
AF
I now think more strongly that the sample efficiency and generalization here is likely to be comparable to normal training or some simple variant.
I think the answer turns out to be: “No, the sample efficiency and generalization are better than normal training.” See our recent post, Steering Llama-2 with contrastive activation additions.
Activation additions generalize better than in-context learning and supervised finetuning for increasing sycophancy, and at least as well for decreasing sycophancy. Also, better sample efficiency from our LLAMA-2 data and from other work on activation engineering. Also, as I predicted, the benefits stack with those of finetuning and in-context learning.
- ryan_greenblatt 2 Jan 2024 20:15 UTC
  LW: 2 AF: 1
  0
  AF Parent
  On sample efficiency and generalization more broadly, I now overall think something like:
  - Using contrast pairs for variance reduction is a useful technique for improving sample efficiency. (And I was foolish to not understand this was part of the method in this post.)
  - I’m unsure what is the best way to use contrast pairs to maximize sample efficiency. It seem plausible to me that it will be something like activation addition, but I could also imagine DPO or some variant of DPO working better in practice. It would be interesting to see further work comparing these methods and also trying to do ablations to understand where the advantages of the best methods come from. (That said, while this is interesting, I’m unsure how important it is to improve sample efficiency from an AI x-safety perspective.)
  - I don’t think any of the generalization results I see in the linked post are very interesting as the generalizations don’t feel importantly analogous to generalizations I care about. (I think all the results are on generalization from multiple choice question answering to free response?) I’d be more excited about generalization work targeting settings where oversight is difficult and a weak supervisor would make mistakes that result in worse policy behavior (aka weak-to-strong generalization). See this post for more discussion of the setting I’m thinking about.
  - ryan_greenblatt 8 Jan 2024 18:54 UTC
    LW: 4 AF: 3
    2
    AF Parent
    Due to the results noted in in TurnTrout’s comment here from Liu et al., I now don’t think the action is mostly coming from contrast pairs (in at least some cases).
    
    So, there is higher sample efficiency for activation engineering stuff over LoRA finetuning in some cases.^[1]
    
    (Though it feels to me like there should be some more principled SGD style method which captures the juice.)
    
    ↩︎
    Up to methodological error in learning rates etc.
- ryan_greenblatt 2 Jan 2024 20:08 UTC
  LW: 2 AF: 1
  0
  AF Parent
  
  I think the answer turns out to be: “No, the sample efficiency and generalization are better than normal training.”
  
  From my understanding of your results, this isn’t true for removing sycophancy, the original task I was talking about? My core claim was that removing blatent sycophancy like in this anthropic dataset is pretty easy in practice.
- ryan_greenblatt 2 Jan 2024 19:39 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Edit: This comment now seems kinda silly as you basically addressed this in your comment and I missed it, feel free to ignore.
  
  Also, as I predicted, the benefits stack with those of finetuning and in-context learning.
  
  For the task of removing sycophancy this isn’t clearly true right? As you note in the linked post:
  
  Very low sycophancy is achieved both by negative finetuning and subtracting the sycophancy vector. The rate is too low to examine how well the interventions stack with each other.
  
  TBC, it could be that there are some settings where removing sycophancy using the most natural and straightforward training strategy (e.g. DPO on contrast pairs) only goes part way and stacking activation addition goes further. But I don’t think the linked post shows this.
  
  (Separately, the comparison in the linked post is when generalizing from multiple choice question answering to free response. This seems like a pretty unnatural way to do the finetuning and I expect finetuning works better using more natural approaches. Of course, this generalization could still be interesting.)