LawrenceC comments on Concrete Steps to Get Started in Transformer Mechanistic Interpretability

LawrenceC 26 Dec 2022 4:25 UTC
LW: 8 AF: 2
7
AF
Many forms of interpretability seek to explain how the network’s outputs relate high level concepts without referencing the actual functioning of the network. Saliency maps are a classic example, as are “build an interpretable model” techniques such as LIME.
In contrast, mechanistic interpretability tries to understand the mechanisms that compose the network. To use Chris Olah’s words:
Mechanistic interpretability seeks to reverse engineer neural networks, similar to how one might reverse engineer a compiled binary computer program.
Or see this post by Daniel Filan.
- Neel Nanda 26 Dec 2022 12:23 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Thanks! That’s a great explanation, I’ve integrated some of this wording into my MI explainer (hope that’s fine!)
- Alexander 26 Dec 2022 4:30 UTC
  2 points
  0
  Parent
  Wonderful, thank you!