Great resource, thanks for sharing! As somebody who’s not too deeply familiar with either mechanistic interpretability or the academic field of interpretability, I find myself confused by the fact that AI safety folks usually dismiss the large academic field of interpretability. Most academic work on ML isn’t useful for safety because safety studies different problems with different kinds of systems. But unlike focusing on worst-case robustness or inner misalignment, I would expect generating human understandable explanations of what neural networks are doing to be interesting to plenty of academics, and I would think that’s what the academics are trying to do. Are they just bad at generating insights? Do they look for the wrong kinds of progress, perhaps motivated by different goals? Why is the large academic field of interpretability not particularly useful for x-risk motivated AI safety?
Honestly, I also feel fairly confused by this—mechanistic questions are just so interesting. Empirically, I’ve fairly rarely found academic interpretability that interesting or useful, though I haven’t read that widely (though there are definitely some awesome papers from academia as linked in the post, and some solid academics, and many more papers that contain some moderately useful insight).
To be clear, I am focusing on mechanistic interpretability—actually reverse engineering the underlying algorithms learned by a model—and I think there’s legitimate and serious work to be done in other areas that could reasonably be called interpretability.
My take would roughly be that there’s a few factors—but again, I’m fairly confused by this and “it’s actually great and I’m just being a chauvinist” is also a pretty coherent explanation (and I know some alignment researchers who’d argue for the latter hypothesis):
Doing rigorous mechanistic work is just fairly hard, and doesn’t really fit the ML paradigm—it doesn’t really work to frame in terms of eg benchmarks, and it’s often more qualitative than quantitative. And thus is both difficulty and so hard to publish in.
Lots of interpretability/explainability work treats the ground truth as things like “do human operators rate this explanation as helpful” or “does this explanation help human operators understand the model’s output better”, which feel like fairly boring metrics to me, and not very relevant to mechanistic stuff.
Lots of work focuses too much on pretty abstractions (eg syntax trees and formal grammars) and not enough on grounding their work in what’s actually going on inside the model.
Mechanistic interpretability is pre-paradigmatic—there just isn’t an agreed upon way to make progress and find truth, nor an established set of techniques. This both makes it harder to do research in, and harder to judge the quality of work in (and thus also harder to publish in!).
I think ease of publishing is a pretty important point, even if an academic doesn’t personally care about publications, often their collaborators/students/supervisors might, and there’s strong career incentives to care. Eg, if a PhD student wants to work on this, and it’d be much harder to publish on it, a good supervisor probably should be discouraging of it (within reason), since part of their job is looking out for their student’s career interests.
Though hopefully it’s getting easier to publish in nowadays! There’s a few mechanistic papers submitted to ICLR.
I mean, personally I’d say it’s the only hope we have of making any of the reflection algorithms have any shot at working. you can’t do formal verification unless your network is at least interpretable enough that, when formal verification fails, you can know what it is about the dataset means the network had to make your non-neural prover run out of compute time when you try to ask what the lowest margin to a misbehavior is. if the network’s innards aren’t readable enough to get an intuitive sense of why a subpath failed the verification or what computation the network performed that timed out the verifier, it’s hard to move the data around to clarify.
of course, this doesn’t help that much if it turns out that a strong planner can amplify a very slight value misalignment quickly as expected by miri folks; afaict, miri is worried that the process of learning (their word is “self improving”) can speed up a huge amount when the network can make use of the full self-rewriting possibility of its substrate and the network properly understands the information geometry of program updates (ie, afaict, they expect significant amounts of generalizing improvement of architecture or learning rule or such things once its strong enough to become a strong quine as an incidental step of doing its core task.)
and so interpretability would be expected to be made useless by the ai breaking into your tensorflow to edit its own compute graph or your pytorch to edit its own matmul invocation order or something. presumably that doesn’t happen at the expected level until you have an ai strong enough to significantly exceed the generalization performance of current architecture search incidentally without being aimed at that, because the ais they’re imagining wouldn’t have even been trained on that specifically the way eg alphatensor was narrowly aimed at matmul itself.
wow this really got me on a train of thinking, I’m going to post more rambling to my shortform.
Great resource, thanks for sharing! As somebody who’s not too deeply familiar with either mechanistic interpretability or the academic field of interpretability, I find myself confused by the fact that AI safety folks usually dismiss the large academic field of interpretability. Most academic work on ML isn’t useful for safety because safety studies different problems with different kinds of systems. But unlike focusing on worst-case robustness or inner misalignment, I would expect generating human understandable explanations of what neural networks are doing to be interesting to plenty of academics, and I would think that’s what the academics are trying to do. Are they just bad at generating insights? Do they look for the wrong kinds of progress, perhaps motivated by different goals? Why is the large academic field of interpretability not particularly useful for x-risk motivated AI safety?
Honestly, I also feel fairly confused by this—mechanistic questions are just so interesting. Empirically, I’ve fairly rarely found academic interpretability that interesting or useful, though I haven’t read that widely (though there are definitely some awesome papers from academia as linked in the post, and some solid academics, and many more papers that contain some moderately useful insight).
To be clear, I am focusing on mechanistic interpretability—actually reverse engineering the underlying algorithms learned by a model—and I think there’s legitimate and serious work to be done in other areas that could reasonably be called interpretability.
My take would roughly be that there’s a few factors—but again, I’m fairly confused by this and “it’s actually great and I’m just being a chauvinist” is also a pretty coherent explanation (and I know some alignment researchers who’d argue for the latter hypothesis):
Doing rigorous mechanistic work is just fairly hard, and doesn’t really fit the ML paradigm—it doesn’t really work to frame in terms of eg benchmarks, and it’s often more qualitative than quantitative. And thus is both difficulty and so hard to publish in.
Lots of interpretability/explainability work treats the ground truth as things like “do human operators rate this explanation as helpful” or “does this explanation help human operators understand the model’s output better”, which feel like fairly boring metrics to me, and not very relevant to mechanistic stuff.
Lots of work focuses too much on pretty abstractions (eg syntax trees and formal grammars) and not enough on grounding their work in what’s actually going on inside the model.
Mechanistic interpretability is pre-paradigmatic—there just isn’t an agreed upon way to make progress and find truth, nor an established set of techniques. This both makes it harder to do research in, and harder to judge the quality of work in (and thus also harder to publish in!).
I think ease of publishing is a pretty important point, even if an academic doesn’t personally care about publications, often their collaborators/students/supervisors might, and there’s strong career incentives to care. Eg, if a PhD student wants to work on this, and it’d be much harder to publish on it, a good supervisor probably should be discouraging of it (within reason), since part of their job is looking out for their student’s career interests.
Though hopefully it’s getting easier to publish in nowadays! There’s a few mechanistic papers submitted to ICLR.
I mean, personally I’d say it’s the only hope we have of making any of the reflection algorithms have any shot at working. you can’t do formal verification unless your network is at least interpretable enough that, when formal verification fails, you can know what it is about the dataset means the network had to make your non-neural prover run out of compute time when you try to ask what the lowest margin to a misbehavior is. if the network’s innards aren’t readable enough to get an intuitive sense of why a subpath failed the verification or what computation the network performed that timed out the verifier, it’s hard to move the data around to clarify.
of course, this doesn’t help that much if it turns out that a strong planner can amplify a very slight value misalignment quickly as expected by miri folks; afaict, miri is worried that the process of learning (their word is “self improving”) can speed up a huge amount when the network can make use of the full self-rewriting possibility of its substrate and the network properly understands the information geometry of program updates (ie, afaict, they expect significant amounts of generalizing improvement of architecture or learning rule or such things once its strong enough to become a strong quine as an incidental step of doing its core task.)
and so interpretability would be expected to be made useless by the ai breaking into your tensorflow to edit its own compute graph or your pytorch to edit its own matmul invocation order or something. presumably that doesn’t happen at the expected level until you have an ai strong enough to significantly exceed the generalization performance of current architecture search incidentally without being aimed at that, because the ais they’re imagining wouldn’t have even been trained on that specifically the way eg alphatensor was narrowly aimed at matmul itself.
wow this really got me on a train of thinking, I’m going to post more rambling to my shortform.