Neel Nanda comments on (OLD) An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers

Neel Nanda 18 Oct 2022 22:49 UTC
13 points
5
Honestly, I also feel fairly confused by this—mechanistic questions are just so interesting. Empirically, I’ve fairly rarely found academic interpretability that interesting or useful, though I haven’t read that widely (though there are definitely some awesome papers from academia as linked in the post, and some solid academics, and many more papers that contain some moderately useful insight).

To be clear, I am focusing on mechanistic interpretability—actually reverse engineering the underlying algorithms learned by a model—and I think there’s legitimate and serious work to be done in other areas that could reasonably be called interpretability.

My take would roughly be that there’s a few factors—but again, I’m fairly confused by this and “it’s actually great and I’m just being a chauvinist” is also a pretty coherent explanation (and I know some alignment researchers who’d argue for the latter hypothesis):
- Doing rigorous mechanistic work is just fairly hard, and doesn’t really fit the ML paradigm—it doesn’t really work to frame in terms of eg benchmarks, and it’s often more qualitative than quantitative. And thus is both difficulty and so hard to publish in.
- Lots of interpretability/explainability work treats the ground truth as things like “do human operators rate this explanation as helpful” or “does this explanation help human operators understand the model’s output better”, which feel like fairly boring metrics to me, and not very relevant to mechanistic stuff.
- Lots of work focuses too much on pretty abstractions (eg syntax trees and formal grammars) and not enough on grounding their work in what’s actually going on inside the model.
- Mechanistic interpretability is pre-paradigmatic—there just isn’t an agreed upon way to make progress and find truth, nor an established set of techniques. This both makes it harder to do research in, and harder to judge the quality of work in (and thus also harder to publish in!).
I think ease of publishing is a pretty important point, even if an academic doesn’t personally care about publications, often their collaborators/students/supervisors might, and there’s strong career incentives to care. Eg, if a PhD student wants to work on this, and it’d be much harder to publish on it, a good supervisor probably should be discouraging of it (within reason), since part of their job is looking out for their student’s career interests.

Though hopefully it’s getting easier to publish in nowadays! There’s a few mechanistic papers submitted to ICLR.