lone17 comments on Refusal in LLMs is mediated by a single direction

lone17 29 Oct 2024 11:47 UTC
1 point
0
suggesting that the direction is “read” or “processed” at some local region.
Interesting point here. I would further add that these local regions might be token-dependent. I’ve found that, at different positions (though I only experimented on the tokens that come after the instruction), the refusal direction can be extracted from different layers. Each of these different refusal directions seems to work well when used to ablating some layers surrounding the layer where it was extracted.
Oh and btw, I found a minor redundancy in the code. The intervention is added to all 3 streams pre, mid, and post. But since the post stream from one layer is also the pre stream of the next layer, we can just process either of the 2. Having both won’t produce any issues but could slow down the experiments. There is one edge case on either the last layer’s post or the first layer’s pre, but that can be fixed easily or ignore entirely anyway.