Jinjin Zhao comments on An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Jinjin Zhao 15 Jul 2024 19:51 UTC
1 point
−3
AF
I am curious about your thoughts on the differences between activation patching and SAE. Do you think they are complimentary research, or may there be some overarching idea that encapsulates both?

Is there any application for one that can’t be done with the other? It seems that activation patching may result in more interpretable concepts, but SAE may result in more fundamental features. My intuition is that it may be possible for activation patching to replace SAEs in the future.
- Neel Nanda 16 Jul 2024 18:23 UTC
  2 points
  0
  Parent
  Imo they’re just completely different techniques, which aren’t really comparable. Activation patching is about understanding the difference between two activations by patching one to replace the other and seeing what happens. SAEs are a technique for decomposing an activation into interpretable pieces