Dan H comments on Refusal in LLMs is mediated by a single direction

Dan H 27 Apr 2024 21:14 UTC
1 point
−21
AF
From Andy Zou:

Thank you for your reply.

Model interventions to bypass refusal are not discussed in Section 6.2.

We perform model interventions to robustify refusal (your section on “Adding in the “refusal direction” to induce refusal”). Bypassing refusal, which we do in the GitHub demo, is merely adding a negative sign to the direction. Either of these experiments show refusal can be mediated by a single direction, in keeping with the title of this post.

we examined Section 6.2 carefully before writing our work

Not mentioning it anywhere in your work is highly unusual given its extreme similarity. Knowingly not citing probably the most related experiments is generally considered plagiarism or citation misconduct, though this is a blog post so norms for thoroughness are weaker. (lightly edited by Dan for clarity)

Ablating vs. Addition

We perform a linear combination operation on the representation. Projecting out the direction is one instantiation of it with a particular coefficient, which is not necessary as shown by our GitHub demo. (Dan: we experimented with projection in the RepE paper and didn’t find it was worth the complication. We look forward to any results suggesting a strong improvement.)

--

Please reach out to Andy if you want to talk more about this.

Edit: The work is prior art (it’s been over six months+standard accessible format), the PIs are aware of the work (the PI of this work has spoken about it with Dan months ago, and the lead author spoke with Andy about the paper months ago), and its relative similarity is probably higher than any other artifact. When this is on arXiv we’re asking you to cite the related work and acknowledge its similarities rather than acting like these have little to do with each other/not mentioning it. Retaliating by some people dogpile voting/ganging up on this comment to bury sloppy behavior/an embarrassing oversight is not the right response (went to −18 very quickly).

Edit 2: On X, Neel “agree[s] it’s highly relevant” and that he’ll cite it. Assuming it’s covered fairly and reasonably, this resolves the situation.

Edit 3: I think not citing it isn’t a big deal because I think of LW as a place for ml research rough drafts, in which errors will happen. But if some are thinking it’s at the level of an academic artifact/is citable content/is an expectation others cite it going forward, then failing to mention extremely similar results would actually be a bigger deal. Currently I’ll think it’s the former.
- Nina Panickssery 27 Apr 2024 23:05 UTC
  LW: 21 AF: 11
  12
  AF Parent
  FWIW I published this Alignment Forum post on activation steering to bypass refusal (albeit an early variant that reduces coherence too much to be useful) which from what I can tell is the earliest work on linear residual-stream perturbations to modulate refusal in RLHF LLMs.
  
  I think this post is novel compared to both my work and RepE because they:
  - Demonstrate full ablation of the refusal behavior with much less effect on coherence / other capabilities compared to normal steering
  - Investigate projection thoroughly as an alternative to sweeping over vector magnitudes (rather than just stating that this is possible)
  - Find that using harmful/harmless instructions (rather than harmful vs. harmless/refusal responses) to generate a contrast vector is the most effective (whereas other works try one or the other), and also investigate which token position at which to extract the representation
  - Find that projecting away the (same, linear) feature at all layers improves upon steering at a single layer, which is different from standard activation steering
  - Test on many different models
  - Describe a way of turning this into a weight-edit
  Edit:
  
  (Want to flag that I strong-disagree-voted with your comment, and am not in the research group—it is not them “dogpiling”)
  
  I do agree that RepE should be included in a “related work” section of a paper but generally people should be free to post research updates on LW/AF that don’t have a complete thorough lit review / related work section. There are really very many activation-steering-esque papers/blogposts now, including refusal-bypassing-related ones, that all came out around the same time.
  - Dan H 28 Apr 2024 0:40 UTC
    LW: 1 AF: 1
    −5
    AF Parent
    
    but generally people should be free to post research updates on LW/AF that don’t have a complete thorough lit review / related work section.
    
    I agree if they simultaneously agree that they don’t expect the post to be cited. These can’t posture themselves as academic artifacts (“Citing this work” indicates that’s the expectation) and fail to mention related work. I don’t think you should expect people to treat it as related work if you don’t cover related work yourself.
    
    Otherwise there’s a race to the bottom and it makes sense to post daily research notes and flag plant that way. This increases pressure on researchers further.
    
    including refusal-bypassing-related ones
    
    The prior work that is covered in the document is generally less related (fine-tuning removal of safeguards, truth directions) compared to these directly relevant ones. This is an unusual citation pattern and gives the impression that the artifact is making more progress/advancing understanding than it actually is.
    
    I’ll note pretty much every time I mention something isn’t following academic standards on LW I get ganged up on and I find it pretty weird. I’ve reviewed, organized, and can be senior area chair at ML conferences and know the standards well. Perhaps this response is consistent because it feels like an outside community imposing things on LW.
  - Dan H 28 Apr 2024 1:56 UTC
    LW: -2 AF: 1
    −22
    AF Parent
    
    is novel compared to… RepE
    
    This is inaccurate, and I suggest reading our paper: https://arxiv.org/abs/2310.01405
    
    Demonstrate full ablation of the refusal behavior with much less effect on coherence
    
    In our paper and notebook we show the models are coherent.
    
    Investigate projection
    
    We did investigate projection too (we use it for concept removal in the RepE paper) but didn’t find a substantial benefit for jailbreaking.
    
    harmful/harmless instructions
    
    We use harmful/harmless instructions.
    
    Find that projecting away the (same, linear) feature at all layers improves upon steering at a single layer
    
    In the RepE paper we target multiple layers as well.
    
    Test on many different models
    
    The paper used Vicuna, the notebook used Llama 2. Throughout the paper we showed the general approach worked on many different models.
    
    Describe a way of turning this into a weight-edit
    
    We do weight editing in the RepE paper (that’s why it’s called RepE instead of ActE).
    - Nina Panickssery 28 Apr 2024 11:21 UTC
      LW: 17 AF: 10
      20
      AF Parent
      We do weight editing in the RepE paper (that’s why it’s called RepE instead of ActE)
      I looked at the paper again and couldn’t find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction).
      
      The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.
    - Nina Panickssery 28 Apr 2024 7:25 UTC
      LW: 17 AF: 12
      9
      AF Parent
      I agree you investigate a bunch of the stuff I mentioned generally somewhere in the paper, but did you do this for refusal-removal in particular? I spent some time on this problem before and noticed that full refusal ablation is hard unless you get the technique/vector right, even though it’s easy to reduce refusal or add in a bunch of extra refusal. That’s why investigating all the technique parameters in the context of refusal in particular is valuable.
- Andy Arditi 27 Apr 2024 22:51 UTC
  11 points
  0
  Parent
  I will reach out to Andy Zou to discuss this further via a call, and hopefully clear up what seems like a misunderstanding to me.
  One point of clarification here though—when I say “we examined Section 6.2 carefully before writing our work,” I meant that we reviewed it carefully to understand it and to check that our findings were distinct from those in Section 6.2. We did indeed conclude this to be the case before writing and sharing this work.