I have been experimenting with the notebook and can confirm that ablating at some middle layers is effective at removing the refusal behaviour. I also observed that the effect gets more significant as I increase the number of ablated layers. However, in my experiments, 2-3 layers were insufficient to get a great result. I only saw some minimal effect at 1-3 layers and only with 7 or more layers that the effect is comparable to ablating everywhere. (disclaimers: I’m experimenting with Qwen1 and Qwen2.5 models, this might not hold for other model families)
I select the layers to be ablated as a range of consecutive layers centring at the layer where the refusal direction was extracted. Perhaps the choice of layers was why I got different results from yours ? Could you provide some insight on how you select layers to ablate ?
Another question, I haven’t been able to successfully induce refusal. I tried adding the direction at the layer where it was extracted, at a local region around said layer, and everywhere. But none gives good results. Could there be additional steps that I’m missing here ?
You can see that the ablations before layer ~14 don’t have much of an impact, nor do the ablations after layer ~17. Running another experiment just ablating the refusal direction at layers 14-17 shows that this is roughly as effective as ablating the refusal direction from all layers.
As for inducing refusal, we did a pretty extreme intervention in the paper—we added the difference-in-means vector to every token position, including generated tokens (although only at a single layer). Hard to say what the issue is without seeing your code—I recommend comparing your intervention to the one we define in the paper (it’s implemented in our repo as well).
Thanks for the insight on the locality check experiment.
For inducing refusal, I used the code from the demo notebook provided in your post. It doesn’t have a section on inducing refusal but I just invert the difference-in-means vector and set the intervention layer to the single layer where said vector was extracted. I believe this has the same effect as what you described, which is to apply the intervention to every token at a single layer. Will checkout your repo to see if I missed something. Thank you for the discussion.
Many thanks for the insight.
I have been experimenting with the notebook and can confirm that ablating at some middle layers is effective at removing the refusal behaviour. I also observed that the effect gets more significant as I increase the number of ablated layers. However, in my experiments, 2-3 layers were insufficient to get a great result. I only saw some minimal effect at 1-3 layers and only with 7 or more layers that the effect is comparable to ablating everywhere. (disclaimers: I’m experimenting with Qwen1 and Qwen2.5 models, this might not hold for other model families)
I select the layers to be ablated as a range of consecutive layers centring at the layer where the refusal direction was extracted. Perhaps the choice of layers was why I got different results from yours ? Could you provide some insight on how you select layers to ablate ?
Another question, I haven’t been able to successfully induce refusal. I tried adding the direction at the layer where it was extracted, at a local region around said layer, and everywhere. But none gives good results. Could there be additional steps that I’m missing here ?
One experiment I ran to check the locality:
For ℓ=0,1,…,L:
Ablate the refusal direction at layers ℓ,ℓ+1,…,L
Measure refusal score across harmful prompts
Below is the result for Qwen 1.8B:
You can see that the ablations before layer ~14 don’t have much of an impact, nor do the ablations after layer ~17. Running another experiment just ablating the refusal direction at layers 14-17 shows that this is roughly as effective as ablating the refusal direction from all layers.
As for inducing refusal, we did a pretty extreme intervention in the paper—we added the difference-in-means vector to every token position, including generated tokens (although only at a single layer). Hard to say what the issue is without seeing your code—I recommend comparing your intervention to the one we define in the paper (it’s implemented in our repo as well).
Thanks for the insight on the locality check experiment.
For inducing refusal, I used the code from the demo notebook provided in your post. It doesn’t have a section on inducing refusal but I just invert the difference-in-means vector and set the intervention layer to the single layer where said vector was extracted. I believe this has the same effect as what you described, which is to apply the intervention to every token at a single layer. Will checkout your repo to see if I missed something. Thank you for the discussion.