lone17 comments on Refusal in LLMs is mediated by a single direction

lone17 31 Oct 2024 8:53 UTC
1 point
0
Thanks for the insight on the locality check experiment.
For inducing refusal, I used the code from the demo notebook provided in your post. It doesn’t have a section on inducing refusal but I just invert the difference-in-means vector and set the intervention layer to the single layer where said vector was extracted. I believe this has the same effect as what you described, which is to apply the intervention to every token at a single layer. Will checkout your repo to see if I missed something. Thank you for the discussion.