Reasons for my pessimism about mechanistic interpretability.
Epistemic status: I’ve noticed most AI safety folks seem more optimistic about mechanistic interpretability than I am. This is just a quick list of reasons for my pessimism. Note that I don’t have much direct experience with mech interp, and this is more of a rough brain dump than a well-thought-out take.
Interpretability just seems harder than people expect
For example, recently GDM decided to deprioritize SAEs. I think something like a year ago many people believed SAEs are “the solution” that will make mech interp easy? Pivots are normal and should be expected, but in short timelines we can’t really afford many of them.
So far mech interp results have “exists” quantifier. We might need “all” for safety.
What we have: “This is a statement about horses, and you can see that this horse feature here is active”
What we need: “Here is how we know whether a statement is about horses or not”
Note that these things might be pretty far away, consider e.g. “we know this substance causes cancer” and “we can tell for an arbitrary substance whether it causes cancer or not”.
No specific plans for how mech interp will help
Or maybe there are some and I don’t know them?
Anyway, I feel people often say something like “If we find the deception feature, we’ll know whether models are lying to us, therefore solving deceptive alignment”. This makes sense, but how will we know whether we’ve really found the deception feature?
I think this is related to the previous point (about exists/all quantifiers): we don’t really know how to build mech interp tools that give some guarantees, so it’s hard to imagine what such solution would look like.
Future architectures might make interpretability harder
I think ASI probably won’t be a simple non-recurrent transformer. No strong justification here—just the very rough “It’s unlikely we found something close to the optimal architecture so early”. This leads to two problems:
Our established methods might no longer work, and we might have too little time to develop new ones
The new architecture will likely be more complex, and thus mech interp might get harder
Is there a good reason to believe interpretability of future systems will be possible?
The fact that things like steering vectors or linear probes or SAEs-with-some-reasonable-width somewhat work is a nice feature of the current systems. There’s no guarantee that this will hold for the future, more efficient, systems—they might become extremely polysemantic instead (see here). This is not necessarily true: maybe an optimal network learns to operate on something like natural abstractions, or maybe we’ll favor interpretable architecture over slightly-more-efficient-but-uninterpretable architectures.
----
That’s it. Critical comments are very welcome.
Logprobs returned by OpenAI API are rounded.
This shouldn’t matter for most use cases. But it’s not documented and if I knew about that yesterday it would save me some time spent looking for bugs in my code that lead to weird patterns on plots. Also I couldn’t find any mentions of that on the internet.
Note that o3 says this is probably because of quantization.
Specific example. Let’s say we have some prompt and the next token has the following probabilities:
These probabilities were calculated from the following logprobs, in the same order:
No clear pattern here, and they don’t look like rounded numbers. But if you subtract the highest logprob from all logprobs on the list you get:
And after rounding that to 6 decimal places the pattern becomes clear:
So the logprobs resolution is 1⁄16.
(I tried that with 4.1 and 4.1-mini via the chat completions API)