Andy Arditi comments on Refusal in LLMs is mediated by a single direction

Andy Arditi 2 May 2024 23:05 UTC
LW: 1 AF: 1
0
AF
Was it substantially less effective to instead use $a_{harmless}^{'} \leftarrow a_{harmless} + ({avg_proj}_{harmful})^r$ ?

It’s about the same. And there’s a nice reason why: $a_{harmless} \cdot^r \approx 0$ . I.e. for most harmless prompts, the projection onto the refusal direction is approximately zero (while it’s very positive for harmful prompts). We don’t display this clearly in the post, but you can roughly see it if you look at the PCA figure (PC 1 roughly corresponds to the “refusal direction”). This is (one reason) why we think ablation of the refusal direction works so much better than adding the negative “refusal direction,” and it’s also what motivated us to try ablation in the first place!
I do want to note that your boost in refusals seems absolutely huge, well beyond 8%? I am somewhat surprised by how huge your boost is.
Note that our intervention is fairly strong here, as we are intervening at all token positions (including the newly generated tokens). But in general we’ve found it quite easy to induce refusal, and I believe we could even weaken our intervention to a subset of token positions and achieve similar results. We’ve previously reported the ease by which we can induce refusal (patching just 6 attention heads at a single token position in Llama-2-7B-chat).
Burns et al. do activation engineering? I thought the CCS paper didn’t involve that.
You’re right, thanks for the catch! I’ll update the text so it’s clear that the CCS paper does not perform model interventions.