Andy Arditi comments on Refusal in LLMs is mediated by a single direction

Andy Arditi 1 May 2024 22:57 UTC
2 points
0
[Responding to some select points]
1. I think you’re looking at the harmful_strings dataset, which we do not use. But in general, I agree AdvBench is not the greatest dataset. Multiple follow up papers (Chao et al. 2024, Souly et al. 2024) point this out. We use it in our train set because it contains a large volume of harmful instructions. But our method might benefit from a cleaner training dataset.
2. We don’t use the targets for anything. We only use the instructions (labeled goal in the harmful_behaviors dataset).
5. I think choice of padding token shouldn’t matter with attention mask. I think it should work the same if you changed it.
6. Not sure about other empirically studied features that are considered “high-level action features.”
7. This is a great and interesting point! @wesg has also brought this up before! (I wish you would have made this into its own comment, so that it could be upvoted and noticed by more people!)
8. We have results showing that you don’t actually need to ablate at all layers—there is a narrow / localized region of layers where the ablation is important. Ablating everywhere is very clean and simple as a methodology though, and that’s why we share it here.
As for adding $^r$ at multiple layers—this probably heavily depends on the details (e.g. which layers, how many layers, how much are you adding, etc).
9. We display the second principle component in the post. Notice that it does not separate harmful vs harmless instructions.