Many forms of interpretability seek to explain how the network’s outputs relate high level concepts without referencing the actual functioning of the network. Saliency maps are a classic example, as are “build an interpretable model” techniques such as LIME.
In contrast, mechanistic interpretability tries to understand the mechanisms that compose the network. To use Chris Olah’s words:
Mechanistic interpretability seeks to reverse engineer neural networks, similar to how one might reverse engineer a compiled binary computer program.
Many forms of interpretability seek to explain how the network’s outputs relate high level concepts without referencing the actual functioning of the network. Saliency maps are a classic example, as are “build an interpretable model” techniques such as LIME.
In contrast, mechanistic interpretability tries to understand the mechanisms that compose the network. To use Chris Olah’s words:
Or see this post by Daniel Filan.
Thanks! That’s a great explanation, I’ve integrated some of this wording into my MI explainer (hope that’s fine!)
Wonderful, thank you!