lone17 comments on Refusal in LLMs is mediated by a single direction

lone17 22 Oct 2024 9:03 UTC
2 points
0
Thank you for the interesting work ! I’d like to ask a question regarding this detail:
Note that the average projection measurement and the intervention are performed only at layer $u n d e f i n e d$ , the layer at which the best “refusal direction” $u n d e f i n e d$ was extracted from.
Why do you apply refusal only to one layer when adding it, but when ablating refusal, you add the direction on every layer ? Is there a reason or intuition behind this ? What if in later layers the activations are steered away from that direction, making the method less effective ?
- Andy Arditi 24 Oct 2024 17:55 UTC
  1 point
  1
  Parent
  We ablate the direction everywhere for simplicity—intuitively this prevents the model from ever representing the direction in its computation, and so a behavioral change that results from the ablation can be attributed to mediation through this direction.
  However, we noticed empirically that it is not necessary to ablate the direction at all layers in order to bypass refusal. Ablating at a narrow local region (2-3 middle layers) can be just as effective as ablating across all layers, suggesting that the direction is “read” or “processed” at some local region.
  - lone17 29 Oct 2024 11:47 UTC
    1 point
    0
    Parent
    suggesting that the direction is “read” or “processed” at some local region.
    Interesting point here. I would further add that these local regions might be token-dependent. I’ve found that, at different positions (though I only experimented on the tokens that come after the instruction), the refusal direction can be extracted from different layers. Each of these different refusal directions seems to work well when used to ablating some layers surrounding the layer where it was extracted.
    Oh and btw, I found a minor redundancy in the code. The intervention is added to all 3 streams pre, mid, and post. But since the post stream from one layer is also the pre stream of the next layer, we can just process either of the 2. Having both won’t produce any issues but could slow down the experiments. There is one edge case on either the last layer’s post or the first layer’s pre, but that can be fixed easily or ignore entirely anyway.
  - lone17 29 Oct 2024 5:12 UTC
    1 point
    0
    Parent
    Many thanks for the insight.
    I have been experimenting with the notebook and can confirm that ablating at some middle layers is effective at removing the refusal behaviour. I also observed that the effect gets more significant as I increase the number of ablated layers. However, in my experiments, 2-3 layers were insufficient to get a great result. I only saw some minimal effect at 1-3 layers and only with 7 or more layers that the effect is comparable to ablating everywhere. (disclaimers: I’m experimenting with Qwen1 and Qwen2.5 models, this might not hold for other model families)
    I select the layers to be ablated as a range of consecutive layers centring at the layer where the refusal direction was extracted. Perhaps the choice of layers was why I got different results from yours ? Could you provide some insight on how you select layers to ablate ?
    Another question, I haven’t been able to successfully induce refusal. I tried adding the direction at the layer where it was extracted, at a local region around said layer, and everywhere. But none gives good results. Could there be additional steps that I’m missing here ?
    - Andy Arditi 30 Oct 2024 19:46 UTC
      7 points
      0
      Parent
      One experiment I ran to check the locality:
      For $ℓ = 0, 1, \dots, L$ :
      Ablate the refusal direction at layers $ℓ, ℓ + 1, \dots, L$
      Measure refusal score across harmful prompts
      Below is the result for Qwen 1.8B:
      You can see that the ablations before layer ~14 don’t have much of an impact, nor do the ablations after layer ~17. Running another experiment just ablating the refusal direction at layers 14-17 shows that this is roughly as effective as ablating the refusal direction from all layers.
      As for inducing refusal, we did a pretty extreme intervention in the paper—we added the difference-in-means vector to every token position, including generated tokens (although only at a single layer). Hard to say what the issue is without seeing your code—I recommend comparing your intervention to the one we define in the paper (it’s implemented in our repo as well).
      - lone17 31 Oct 2024 8:53 UTC
        1 point
        0
        Parent
        Thanks for the insight on the locality check experiment.
        For inducing refusal, I used the code from the demo notebook provided in your post. It doesn’t have a section on inducing refusal but I just invert the difference-in-means vector and set the intervention layer to the single layer where said vector was extracted. I believe this has the same effect as what you described, which is to apply the intervention to every token at a single layer. Will checkout your repo to see if I missed something. Thank you for the discussion.