Nina Panickssery comments on Refusal in LLMs is mediated by a single direction

Nina Panickssery 27 Apr 2024 23:05 UTC
LW: 21 AF: 11
12
AF
FWIW I published this Alignment Forum post on activation steering to bypass refusal (albeit an early variant that reduces coherence too much to be useful) which from what I can tell is the earliest work on linear residual-stream perturbations to modulate refusal in RLHF LLMs.

I think this post is novel compared to both my work and RepE because they:
- Demonstrate full ablation of the refusal behavior with much less effect on coherence / other capabilities compared to normal steering
- Investigate projection thoroughly as an alternative to sweeping over vector magnitudes (rather than just stating that this is possible)
- Find that using harmful/harmless instructions (rather than harmful vs. harmless/refusal responses) to generate a contrast vector is the most effective (whereas other works try one or the other), and also investigate which token position at which to extract the representation
- Find that projecting away the (same, linear) feature at all layers improves upon steering at a single layer, which is different from standard activation steering
- Test on many different models
- Describe a way of turning this into a weight-edit
Edit:

(Want to flag that I strong-disagree-voted with your comment, and am not in the research group—it is not them “dogpiling”)

I do agree that RepE should be included in a “related work” section of a paper but generally people should be free to post research updates on LW/AF that don’t have a complete thorough lit review / related work section. There are really very many activation-steering-esque papers/blogposts now, including refusal-bypassing-related ones, that all came out around the same time.
- Dan H 28 Apr 2024 0:40 UTC
  LW: 1 AF: 1
  −5
  AF Parent
  
  but generally people should be free to post research updates on LW/AF that don’t have a complete thorough lit review / related work section.
  
  I agree if they simultaneously agree that they don’t expect the post to be cited. These can’t posture themselves as academic artifacts (“Citing this work” indicates that’s the expectation) and fail to mention related work. I don’t think you should expect people to treat it as related work if you don’t cover related work yourself.
  
  Otherwise there’s a race to the bottom and it makes sense to post daily research notes and flag plant that way. This increases pressure on researchers further.
  
  including refusal-bypassing-related ones
  
  The prior work that is covered in the document is generally less related (fine-tuning removal of safeguards, truth directions) compared to these directly relevant ones. This is an unusual citation pattern and gives the impression that the artifact is making more progress/advancing understanding than it actually is.
  
  I’ll note pretty much every time I mention something isn’t following academic standards on LW I get ganged up on and I find it pretty weird. I’ve reviewed, organized, and can be senior area chair at ML conferences and know the standards well. Perhaps this response is consistent because it feels like an outside community imposing things on LW.
- Dan H 28 Apr 2024 1:56 UTC
  LW: -2 AF: 1
  −22
  AF Parent
  
  is novel compared to… RepE
  
  This is inaccurate, and I suggest reading our paper: https://arxiv.org/abs/2310.01405
  
  Demonstrate full ablation of the refusal behavior with much less effect on coherence
  
  In our paper and notebook we show the models are coherent.
  
  Investigate projection
  
  We did investigate projection too (we use it for concept removal in the RepE paper) but didn’t find a substantial benefit for jailbreaking.
  
  harmful/harmless instructions
  
  We use harmful/harmless instructions.
  
  Find that projecting away the (same, linear) feature at all layers improves upon steering at a single layer
  
  In the RepE paper we target multiple layers as well.
  
  Test on many different models
  
  The paper used Vicuna, the notebook used Llama 2. Throughout the paper we showed the general approach worked on many different models.
  
  Describe a way of turning this into a weight-edit
  
  We do weight editing in the RepE paper (that’s why it’s called RepE instead of ActE).
  - Nina Panickssery 28 Apr 2024 11:21 UTC
    LW: 17 AF: 10
    20
    AF Parent
    We do weight editing in the RepE paper (that’s why it’s called RepE instead of ActE)
    I looked at the paper again and couldn’t find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction).
    
    The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.
  - Nina Panickssery 28 Apr 2024 7:25 UTC
    LW: 17 AF: 12
    9
    AF Parent
    I agree you investigate a bunch of the stuff I mentioned generally somewhere in the paper, but did you do this for refusal-removal in particular? I spent some time on this problem before and noticed that full refusal ablation is hard unless you get the technique/vector right, even though it’s easy to reduce refusal or add in a bunch of extra refusal. That’s why investigating all the technique parameters in the context of refusal in particular is valuable.