Why I’m Moving from Mechanistic to Prosaic Interpretability

Tl;dr I’ve decided to shift my research from mechanistic interpretability to more empirical (“prosaic”) interpretability /​ safety work. Here’s why.

All views expressed are my own.

What really interests me: High-level cognition

I care about understanding how powerful AI systems think internally. I’m drawn to high-level questions (“what are the model’s goals /​ beliefs?”) as opposed to low-level mechanics (“how does the model store and use [specific fact]?”). Sure, figuring out how a model does modular addition is cool, but only insofar as those insights and techniques generalise to understanding higher-level reasoning.

Mech interp has been disappointing

Vis-a-vis answering these high-level conceptual questions, mechanistic interpretability has been disappointing. IOI remains the most interesting circuit we’ve found in any language model. That’s pretty damning. If mechanistic interpretability worked well, we should have already mapped out lots of interesting circuits in open-source 7B models by now.

The field seems conceptually bottlenecked. We simply can’t agree on what ‘features’ are or how to ‘extract’ them. I’m also not sure that this conceptual ennui will be resolved anytime soon.

Doing mech interp research led me to update against it

Some time ago, I was pretty optimistic that things would change quickly. After hearing about sparse feature circuits, I became incredibly convinced that approaches like this would ‘finally’ allow us to understand language models end to end.

So I committed fully to the nascent SAE bandwagon. At a hackathon, I worked on building a tool for visualizing sparse feature circuits. When I got the chance, I threw myself into Neel Nanda’s MATS 6.0 training phase, where I similarly worked (with the excellent @jacob_drori) on extending sparse feature circuits with MLP transcoders. Overall there were signs of life, but it turned out kind of mid and my main takeaway was ‘existing SAEs might not be good enough to tell us anything useful about circuits’. As I continued working on various interp related things, I hit other roadblocks. Concrete example: I tried looking for refusal circuits in Gemma-2b and largely didn’t find anything interesting.[1]

In hindsight, my object-level findings weren’t very good and didn’t inform my object-level takes much. On a more meta level, though, I came away thinking more strongly that: Model internals are messy. Really messy in ways we can’t simply explain. This means that our best paradigms are at best incomplete and at worst actively misleading.

“Prosaic Interpretability”

I’m therefore coining the term “prosaic interpretability”—an approach to understanding model internals that isn’t strongly based on a pre-existing theory of neural networks or intelligence[2], but instead aims to build intuitions /​ dogma from the ground up, based on empirical observation.

Concretely, I’ve been really impressed by work like Owain Evans’ research on the Reversal Curse, Two-Hop Curse, and Connecting the Dots[3]. These feel like they’re telling us something real, general, and fundamental about how language models think. Despite being primarily empirical, such work is well-formulated conceptually, and yields gearsy mental models of neural nets, independently of existing paradigms.

How does this compare to mech interp? Both are fundamentally bottom-up methods to answering top-down questions. But with mech interp it feels like the focus is often too heavily on the method—trying to prove that some paradigm (LRH, SAEs, steering vectors, what have you) is a valid way to approach a problem. With prosaic interp I’d argue that the focus is, instead, on hugging the question tightly, exploring it from multiple angles, considering multiple adjacent questions, and delivering an honest answer.

Intuition pump: Gene analysis for medicine

Using mechanistic interpretability for AI safety is like trying to cure diseases by understanding every single gene in the human genome. Obviously, when it works, it’s incredibly powerful. There are lots of diseases which have been treated in this way. And the big sunk cost can be amortised over lots of potential future application.

At the same time, there are diseases that continue to elude effective treatment despite our understanding of the human genome.

Prosaic work is more like testing different treatments to see what actually helps people, and using that to make scientific inferences. Like inoculation being practised long before Robert Koch developed his germ theory of disease. This might not give you the deepest possible understanding, but it often gets results faster. And shorn of understanding, it seems like the only way forward.

Modern AI Systems will make interpretability difficult

AI systems aren’t just transformers anymore—they have all sorts of extra bits bolted on, like scaffolding and tool use and inference-time algorithms and swarm architectures. Mechanistic interpretability is stuck looking at individual transformers and their neurons, while the actual frontier keeps moving. We’re studying pieces of systems that are becoming less and less like what’s actually being deployed. Each day, the world of ‘frontier AI system’ continues to expand. The view from the platform of ‘transformer circuits’ is that of a rapidly receding horizon.

Prosaic work doesn’t have this problem. It’s always kept its eyes on the whole system.

The timing is frustrating

I feel like I’m stepping away just as mechanistic interpretability is about to get really interesting. Practical demos are starting to emerge of interpretability-based alignment being effective and beating baselines. Thanks to startups like Goodfire and Transluce, such techniques may even become real products. Longtime critics are retracting their doubts. And fundamental work continues to charge full steam ahead; developments like MDL SAEs or Matryoshka SAEs could turbocharge SAE-based interpretability. In the near future, we might even be able to train models to be interpretable. All of this adds up to the wider public being bullish on interp, much more so than ever before.

My personal pessimism is coinciding with a sustained backdrop of broader optimism—and this makes me feel very conflicted about deciding to step away.

Personal fit

I spent the last 6 months trying to make progress on mechanistic interpretability. I think I’m reasonably competent. But I just didn’t get very far. There are many mundane contributing factors to this. Among others: a lack of good mentorship /​ collaboration opportunities, poor self-management, mediocre research taste. But I think the biggest issue is motivation.

A hard truth I’ve learned about myself: I don’t like working on “fundamental” mechanistic interpretability methods. I’m not frothing with passion to think about how the compositions of high-dimensional matrices can be made slightly more tractable. It feels too disconnected from the high-level conceptual questions I really care about. And “applied” work feels like it’s best left in the hands of domain experts who have deep, hard-won intuitions about the things they are trying to interpret.

The stuff I get most excited about is red-teaming existing interpretability work. This is (broadly) the subject of both my first NeurIPS paper and my hitherto highest-effort LessWrong piece. I like this work because it’s highly conceptual and clarifies subsequent thinking. (Possibly I also just enjoy criticising things.) I’d be open to doing more of this in the future. But red-teaming isn’t exclusive to mech interp.

Overall, I feel like I’ve given mech interp a fair shot and I should roll the dice on something different.

Mech interp research that excites me

To be clear, I remain excited about specific research directions within mechanistic interpretability. “Training models to be interpretable” seems robustly good. Here I’m excited by things like gradient routing and mixture of monosemantic experts. If someone figures out how to train SAEs to yield sparse feature circuits that’ll also be a big win. “Automating /​ scaling interpretability” also seems like another robustly good direction, since it leverages improvements in capabilities. I don’t have a good read of this space, but things like PatchScopes /​ SelfIE seem interesting. Edge pruning also seems like a viable path to scaling circuit discovery to larger models (and is the only work I’ve ever seen so far that claims to find a circuit in a 7b+ size model)

Looking forward

I’m not ruling out coming back to mechanistic interpretability. I’ll likely continue to keep tabs on the field. And I’ll probably always be happy to discuss /​ critique new research.

But for now, I’m stepping away. I’m starting MATS with Owain Evans in January, and my work there will likely focus on other approaches. I’ll keep tabs on the field, but I need to focus on work that better fits my thinking style and research interests.

I’m looking forward to it.

  1. ^

    In my analysis, base-model SAEs also didn’t turn up anything interesting re: refusal features. This has since been validated independently; base-model SAEs do not capture the refusal direction.

  2. ^

    This mostly fits with Paul Christiano’s definition of prosaic AI alignment.

  3. ^

    To avoid claims of bias, some non-Owain examples are how deep safety training improves alignment robustness, comprehensively analysing grokking, comparing data attribution of factual vs procedural knowledge and investigating latent reasoning in LLMs. Things like ‘understanding chain of thought faithfulness’ also go in here.