Neel Nanda comments on An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Neel Nanda 9 Jul 2024 13:17 UTC
LW: 3 AF: 2
0
AF
Glad you liked the post!

I’m also pretty interested in combining steering vectors. I think a particularly promising direction is using SAE decoder vectors for this, as SAEs are designed to find feature vectors that independently vary and can be added.

I agree steering vectors are important as evidence for the linear representation hypothesis (though at this point I consider SAEs to be much superior as evidence, and think they’re more interesting to focus on)