TurnTrout comments on Refusal in LLMs is mediated by a single direction

TurnTrout 2 May 2024 16:19 UTC
LW: 3 AF: 3
0
AF
When we then run the model on harmless prompts, we intervene such that the expression of the “refusal direction” is set to the average expression on harmful prompts:
$a_{harmless}^{'} \leftarrow a_{harmless} - (a_{harmless} \cdot^r)^r + ({avg_proj}_{harmful})^r$
Note that the average projection measurement and the intervention are performed only at layer $l$ , the layer at which the best “refusal direction” $^r$ was extracted from.
Was it substantially less effective to instead use
$a_{harmless}^{'} \leftarrow a_{harmless} + ({avg_proj}_{harmful})^r$
?
We find this result unsurprising and implied by prior work, but include it for completeness. For example, Zou et al. 2023 showed that adding a harmfulness direction led to an 8 percentage point increase in refusal on harmless prompts in Vicuna 13B.
I do want to note that your boost in refusals seems absolutely huge, well beyond 8%? I am somewhat surprised by how huge your boost is.
using this direction to intervene on model activations to steer the model towards or away from the concept (Burns et al. 2022
Burns et al. do activation engineering? I thought the CCS paper didn’t involve that.
- Andy Arditi 2 May 2024 23:05 UTC
  LW: 1 AF: 1
  0
  AF Parent
  Was it substantially less effective to instead use $a_{harmless}^{'} \leftarrow a_{harmless} + ({avg_proj}_{harmful})^r$ ?
  
  It’s about the same. And there’s a nice reason why: $a_{harmless} \cdot^r \approx 0$ . I.e. for most harmless prompts, the projection onto the refusal direction is approximately zero (while it’s very positive for harmful prompts). We don’t display this clearly in the post, but you can roughly see it if you look at the PCA figure (PC 1 roughly corresponds to the “refusal direction”). This is (one reason) why we think ablation of the refusal direction works so much better than adding the negative “refusal direction,” and it’s also what motivated us to try ablation in the first place!
  I do want to note that your boost in refusals seems absolutely huge, well beyond 8%? I am somewhat surprised by how huge your boost is.
  Note that our intervention is fairly strong here, as we are intervening at all token positions (including the newly generated tokens). But in general we’ve found it quite easy to induce refusal, and I believe we could even weaken our intervention to a subset of token positions and achieve similar results. We’ve previously reported the ease by which we can induce refusal (patching just 6 attention heads at a single token position in Llama-2-7B-chat).
  Burns et al. do activation engineering? I thought the CCS paper didn’t involve that.
  You’re right, thanks for the catch! I’ll update the text so it’s clear that the CCS paper does not perform model interventions.