Aaron_Scher comments on Refusal in LLMs is mediated by a single direction

Aaron_Scher 29 Apr 2024 17:55 UTC
2 points
0
This might be a dumb question(s), I’m struggling to focus today and my linear algebra is rusty.
1. Is the observation that ‘you can do feature ablation via weight orthogonalization’ a new one?
2. It seems to me like this (feature ablation via weight orthogonalization) is a pretty powerful tool which could be applied to any linearly represented feature. It could be useful for modulating those features, and as such is another way to do ablations to validate a feature (part of the ‘how do we know we’re not fooling ourselves about our results’ toolkit). Does this seem right? Or does it not actually add much?
- Andy Arditi 1 May 2024 22:22 UTC
  1 point
  0
  Parent
  1. Not sure if it’s new, although I haven’t seen it used like this before. I think of the weight orthogonalization as just a nice trick to implement the ablation directly in the weights. It’s mathematically equivalent, and the conceptual leap from inference-time ablation to weight orthogonalization is not a big one.
  2. I think it’s a good tool for analysis of features. There are some examples of this in sections 5 and 6 of Belrose et al. 2023 - they do concept erasure for the concept “gender,” and for the concept “part-of-speech tag.”
  
  My rough mental model is as follows (I don’t really know if it’s right, but it’s how I’m thinking about things):
  - Some features seem continuous, and for these features steering in the positive and negative directions work well.
    For example, the “sentiment” direction. Sentiment can sort of take on continuous values, e.g. −4 (very bad), −1 (slightly bad), 3 (good), 7 (extremely good). Steering in both directions works well—steering in the negative direction causes negative sentiment behavior, and in the positive causes positive sentiment behavior.
  - Some features seem binary, and for these feature steering in the positive direction makes sense (turn the feature on), but ablation makes more sense than negative steering (turn the feature off).
    For example, the refusal direction, as discussed in this post.
  So yeah, when studying a new direction/feature, I think ablation should definitely be one of the things to try.