Gianluca Calcagni comments on An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Gianluca Calcagni 9 Jul 2024 11:53 UTC
1 point
0
AF
Thanks Neel, keep this coming—even if only once every few years :) You helped me clarify lots of confusion I had about the existing techniques.
I am a huge fan of steering vectors / control vectors, and I would love to see future research showing if they can be linearly combined together to achieve multiple behaviours simultaneously (I made a post about this). I don’t think it’s just “internal work”—I think it’s a hint to the fact that language semantics can be linearised as vector spaces (I hope I will be able to formalise mathematically this intuition).

Here a proposal of a possible ELK solution using that approach.
- Neel Nanda 9 Jul 2024 13:17 UTC
  LW: 3 AF: 2
  0
  AF Parent
  Glad you liked the post!
  
  I’m also pretty interested in combining steering vectors. I think a particularly promising direction is using SAE decoder vectors for this, as SAEs are designed to find feature vectors that independently vary and can be added.
  
  I agree steering vectors are important as evidence for the linear representation hypothesis (though at this point I consider SAEs to be much superior as evidence, and think they’re more interesting to focus on)