When we then run the model on harmless prompts, we intervene such that the expression of the “refusal direction” is set to the average expression on harmful prompts:
Note that the average projection measurement and the intervention are performed only at layerl, the layer at which the best “refusal direction” ^r was extracted from.
Was it substantially less effective to instead use
a′harmless←aharmless+(avg_projharmful)^r
?
We find this result unsurprising and implied by prior work, but include it for completeness. For example, Zou et al. 2023 showed that adding a harmfulness direction led to an 8 percentage point increase in refusal on harmless prompts in Vicuna 13B.
I do want to note that your boost in refusals seems absolutely huge, well beyond 8%? I am somewhat surprised by how huge your boost is.
using this direction to intervene on model activations to steer the model towards or away from the concept (Burns et al. 2022
Burns et al. do activation engineering? I thought the CCS paper didn’t involve that.
Was it substantially less effective to instead use a′harmless←aharmless+(avg_projharmful)^r?
It’s about the same. And there’s a nice reason why: aharmless⋅^r≈0. I.e. for most harmless prompts, the projection onto the refusal direction is approximately zero (while it’s very positive for harmful prompts). We don’t display this clearly in the post, but you can roughly see it if you look at the PCA figure (PC 1 roughly corresponds to the “refusal direction”). This is (one reason) why we think ablation of the refusal direction works so much better than adding the negative “refusal direction,” and it’s also what motivated us to try ablation in the first place!
I do want to note that your boost in refusals seems absolutely huge, well beyond 8%? I am somewhat surprised by how huge your boost is.
Note that our intervention is fairly strong here, as we are intervening at all token positions (including the newly generated tokens). But in general we’ve found it quite easy to induce refusal, and I believe we could even weaken our intervention to a subset of token positions and achieve similar results. We’ve previously reported the ease by which we can induce refusal (patching just 6 attention heads at a single token position in Llama-2-7B-chat).
Burns et al. do activation engineering? I thought the CCS paper didn’t involve that.
You’re right, thanks for the catch! I’ll update the text so it’s clear that the CCS paper does not perform model interventions.
Was it substantially less effective to instead use
a′harmless←aharmless+(avg_projharmful)^r?
I do want to note that your boost in refusals seems absolutely huge, well beyond 8%? I am somewhat surprised by how huge your boost is.
Burns et al. do activation engineering? I thought the CCS paper didn’t involve that.
It’s about the same. And there’s a nice reason why: aharmless⋅^r≈0. I.e. for most harmless prompts, the projection onto the refusal direction is approximately zero (while it’s very positive for harmful prompts). We don’t display this clearly in the post, but you can roughly see it if you look at the PCA figure (PC 1 roughly corresponds to the “refusal direction”). This is (one reason) why we think ablation of the refusal direction works so much better than adding the negative “refusal direction,” and it’s also what motivated us to try ablation in the first place!
Note that our intervention is fairly strong here, as we are intervening at all token positions (including the newly generated tokens). But in general we’ve found it quite easy to induce refusal, and I believe we could even weaken our intervention to a subset of token positions and achieve similar results. We’ve previously reported the ease by which we can induce refusal (patching just 6 attention heads at a single token position in Llama-2-7B-chat).
You’re right, thanks for the catch! I’ll update the text so it’s clear that the CCS paper does not perform model interventions.