suggesting that the direction is “read” or “processed” at some local region.
Interesting point here. I would further add that these local regions might be token-dependent. I’ve found that, at different positions (though I only experimented on the tokens that come after the instruction), the refusal direction can be extracted from different layers. Each of these different refusal directions seems to work well when used to ablating some layers surrounding the layer where it was extracted.
Oh and btw, I found a minor redundancy in the code. The intervention is added to all 3 streams pre
, mid
, and post
. But since the post
stream from one layer is also the pre
stream of the next layer, we can just process either of the 2. Having both won’t produce any issues but could slow down the experiments. There is one edge case on either the last layer’s post
or the first layer’s pre
, but that can be fixed easily or ignore entirely anyway.
Thanks for the insight on the locality check experiment.
For inducing refusal, I used the code from the demo notebook provided in your post. It doesn’t have a section on inducing refusal but I just invert the difference-in-means vector and set the intervention layer to the single layer where said vector was extracted. I believe this has the same effect as what you described, which is to apply the intervention to every token at a single layer. Will checkout your repo to see if I missed something. Thank you for the discussion.