Andy Arditi comments on Refusal in LLMs is mediated by a single direction

Andy Arditi 1 May 2024 22:22 UTC
1 point
0
1. Not sure if it’s new, although I haven’t seen it used like this before. I think of the weight orthogonalization as just a nice trick to implement the ablation directly in the weights. It’s mathematically equivalent, and the conceptual leap from inference-time ablation to weight orthogonalization is not a big one.
2. I think it’s a good tool for analysis of features. There are some examples of this in sections 5 and 6 of Belrose et al. 2023 - they do concept erasure for the concept “gender,” and for the concept “part-of-speech tag.”

My rough mental model is as follows (I don’t really know if it’s right, but it’s how I’m thinking about things):
- Some features seem continuous, and for these features steering in the positive and negative directions work well.
  - For example, the “sentiment” direction. Sentiment can sort of take on continuous values, e.g. −4 (very bad), −1 (slightly bad), 3 (good), 7 (extremely good). Steering in both directions works well—steering in the negative direction causes negative sentiment behavior, and in the positive causes positive sentiment behavior.
- Some features seem binary, and for these feature steering in the positive direction makes sense (turn the feature on), but ablation makes more sense than negative steering (turn the feature off).
  - For example, the refusal direction, as discussed in this post.
So yeah, when studying a new direction/feature, I think ablation should definitely be one of the things to try.