Andy Arditi comments on Refusal in LLMs is mediated by a single direction

Andy Arditi 30 Oct 2024 19:46 UTC
7 points
0
One experiment I ran to check the locality:
- For $ℓ = 0, 1, \dots, L$ :
  - Ablate the refusal direction at layers $ℓ, ℓ + 1, \dots, L$
  - Measure refusal score across harmful prompts
Below is the result for Qwen 1.8B:
You can see that the ablations before layer ~14 don’t have much of an impact, nor do the ablations after layer ~17. Running another experiment just ablating the refusal direction at layers 14-17 shows that this is roughly as effective as ablating the refusal direction from all layers.
As for inducing refusal, we did a pretty extreme intervention in the paper—we added the difference-in-means vector to every token position, including generated tokens (although only at a single layer). Hard to say what the issue is without seeing your code—I recommend comparing your intervention to the one we define in the paper (it’s implemented in our repo as well).
- lone17 31 Oct 2024 8:53 UTC
  1 point
  0
  Parent
  Thanks for the insight on the locality check experiment.
  For inducing refusal, I used the code from the demo notebook provided in your post. It doesn’t have a section on inducing refusal but I just invert the difference-in-means vector and set the intervention layer to the single layer where said vector was extracted. I believe this has the same effect as what you described, which is to apply the intervention to every token at a single layer. Will checkout your repo to see if I missed something. Thank you for the discussion.