Thank you for the great response, and the (undeserved) praise of my criticism. I think it’s really good that you’re embracing the slightly unorthodox positions of sticking to ambitious convictions and acknowledging that this is unorthodox. I also really like your (a)-(d) (and agree that many of the adherents of the fields you list would benefit from similar lines of thinking).
I think we largely agree, and much of our disagreement probably boils down to where we draw the boundary between “mechanistic interpretability” and “other”. In particular, I fully agree with the first zoom level in your post, and with the causal structure of much of the rest of the diagram—in particular, I like your notions of alignment robustness and mechanism distinction (the latter of which I think is original to ARC) and I think they may be central in a good alignment scenario. I also think that some notion of LPE should be present. I have some reservations about ELK as ARC envisions it (also of the “too much backchaining” variety), but think that the first-order insights there are valuable.
I think the core cruxes we have are:
You write “Mechanistic interpretability has so far yielded very little in the way of beating baselines at downstream tasks”. If I understand this correctly, you’re saying it hasn’t yet led to engineering improvements, either in capabilities or in “prosaic alignment” (at least compared to baselines like RLHF or “more compute”).
While I agree with this, I think that this isn’t the right metric to apply. Indeed if you applied this metric, most science would not count as progress. Darwin wouldn’t get credit until his ideas got used to breed better crops and Einstein’s relativity would count as unproductive until the A-bomb (and the theory-application gap is much longer if you look at early advances in math and physics). Rather, I think that the question to ask is whether mechinterp (writ large, and in particular including a lot of people working in deep learning with no contact with safety) has made progress in understanding the internal functioning of AI or made nontrivially principled and falsifiable predictions about how it works. Here we would probably agree that the answer is pretty unambiguous. We have strong evidence that interesting semantic features exist in superposition (whether or not this is the way that the internal mechanisms use them). We understand the rough shape of some low-level circuits that do arithmetic and copying, and have rough ideas of the shapes of some high-level mechanisms (e.g. “function vectors”). To my eyes, this should count as progress in a very new science, and if I correctly understood your claim to be that you need to “beat black-box methods at useful tasks” to count as progress, I think this is too demanding.
I think that I’m onboard with you on your desideratum #1 that theories should be “primarily mathematical” – in the sense that I think our tastes for rigor and principled theoretical science are largely aligned (and we both agree that we need good and somewhat fundamental theoretical principles to avoid misalignment). But math isn’t magic. In order to get a good mathematical tool for a real-world context, you need to make sure that you have correctly specified the context where it is to be applied, and more generally that you’ve found the “right formal context” for math. This makes me want to be careful about context before moving on to your insight #2 of trying to guess a specific information-theoretic criterion for how to formalize “an interpretation”. Math is a dance, not a hammer: if a particular application of math isn’t working, it’s more likely that your context is wrong and you need to retarget and work outwards from simple examples, rather than try harder and route around contradictions. If you look at even a very mathy area of science, I would claim that most progress did not come from trying to make a very ambitious theoretical picture work and introducing epicycles in a “builder-breaker” fashion to get around roadblocks. For example if you look at the most mathematically heavy field that has applications in real life, this is QFT and SFT (which uses deep algebraic and topological insights and today is unquestionably useful in computer chips and the like). Its origin comes from physicists observing the idea of “universality” in some physical systems, and this leading Landau and others to work out that a special (though quite large and perturbation-invariant) class of statistical systems can be coarse-grained in a way that leads to these observed behaviors, and this led to ideas of renormalization, modern QFT and the like. If Landau’s generation instead tried to work really hard on mathematically analyzing general magnet-like systems without working up from applications and real-world systems, they’d end up in roughly the same place as Stephen Wolfram of trying to make overly ambitious claims about automata. The importance of looking for good theory-context fit is the main reason I would like to see more back-and-forth between more “boots-on-the-ground” interpretability theorists and more theoretical agendas like ARC and Agent Foundations. I’m optimistic that ARC’s mathematical agenda will eventually start iterating on carefully thinking about context and theory-context fit, but I think that some of the agenda I saw had the suboptimal, “use math as a hammer” shape. I might be misunderstanding here, and would welcome corrections.
More specifically about “stories”, I agree with you that we are unlikely to be able to tell an easy-to-understand story about the internal working of AI’s (and in particular, I am very onboard with your first-level zoom of scalable alignment). I agree that the ultimate form of the thing we’re both gesturing at in the guise of “interpretability” will be some complicated, fractally recursive formalism using a language we probably don’t currently possess. But I think this is sort of true in a lot of other science. Better understanding leads to formulas, ideas and tools with a recursive complexity that humanity wouldn’t have guessed at before discovering them (again, QFT/SFT is an example). I’m not saying that this means “understanding AI will have the same type signature as QFT/ as another science”. But I am saying that the thing it will look like will be some complicated novel shape that isn’t either modern interp or any currently-accessible guess at its final form. And indeed, if it does turn out to take the shape of something that we can guess today – for example if heuristic arguments or SAEs turn out to be a shot in the right direction – I would guess that the best route towards discovering this is to build up a pluralistic collection of ideas that both iterate on creating more elegant/more principled mathematical ideas and iterate on understanding iteratively more interesting pieces of iteratively more general ML models in some class that expands from toy or real-world models. The history of math also does include examples of more “hammer”-like people: e.g. Wiles and Perelman, so making this bet isn’t necessarily bad, and my criticism here should not be taken too prescriptively. In particular, I think your (a)-(d) are once again excellent guardrails against dangerous rabbitholes or communication gaps, and the only thing I can recommend somewhat confidently is to keep applicability to get interesting results about toy systems as a desideratum when building up the ambitious ideas.
Going a bit meta, I should flag an important intuition that we likely diverge on. I think that when some people defend using relatively formal math or philosophy to do alignment, they are going off of the following intuition:
if we restrict to real-world systems, we will be incorporating assumptions about the model class
if we assume these continue to hold for future systems by default, we are assuming some restrictive property remains true in complicated systems despite possible pressure to train against it to avoid detection, or more neutral pressures to learn new and more complex behaviors which break this property.
alternatively, if we try to impose this assumption externally, we will be restricting ourselves to a weaker, “understandable” class of algorithms that will be quickly outcompeted by more generic AI.
The thing I want to point out about this picture is that this models the assumption as closed. I.e., that it makes some exact requirement, like that some parameter is equal to zero. However, many of the most interesting assumptions in physics (including the one that made QFT go brrr, i.e., renormalizability) are open. I.e., they are some somewhat subtle assumptions that are perturbation-invariant and can’t be trained out (though they can be destroyed – in a clearly noticeable way – through new architectures or significant changes in complexity). In fact, there’s a core idea in physical theory, that I learned from some lecture notes of Ludvig Faddeev here, that you can trace through the development of physics as increasingly incorporating systems with more freedoms and introducing perturbations to a physical system starting with (essentially) classical fluid mechanics and tracing out through quantum mechanics → QFT, but always making sure you’re considering a class of systems that are “not too far” from more classical limits. The insight here is that just including more and more freedom and shifting in the directions of this freedom doesn’t get you into the maximal-complexity picture: rather, it gets you into an interesting picture that provably (for sufficiently small perturbations) allows for an interesting amount of complexity with excellent simplifications and coarse-grainings, and deep math.
Phrased less poetically, I’m making a distinction between something being robust and making no assumptions. When thinking mathematically about alignment, what we need is the former. In particular, I predict that if we study systems in the vicinity of realistic (or possibly even toy) systems, even counting on some amount of misalignment pressure, alien complexity, and so on, the pure math we get will be very different – and indeed, I think much more elegant – than if we impose no assumptions at all. I think that someone with this intuition can still be quite pessimistic, can ask for very high levels of mathematical formalism, but will still expect a very high amount of insight and progress from interacting with real-world systems.
We have strong evidence that interesting semantic features exist in superposition
I think a more accurate statement would be “We have a strong evidence that neurons don’t do a single ‘thing’ (either in the human ontology or in any other natural ontology)” combined with “We have a strong evidence that the residual stream represents more ‘things’ than it has dimensions”.
Aren’t both of these what people would (and did) predict without needing to look at models at all?[1] As in both of these are the null hypothesis in a certain sense. It would be kinda specific if neurons did do a single thing and we can rule out 1 “thing” per dimension in the residual stream via just noting that Transformers work at all.
I think there are more detailed models of a more specific thing called “superposition” within toy models, but I don’t think we have strong evidence of any very specific claim about larger AIs.
(SAE research has shown that SAEs often find directions which seem to at least roughly correspond to concepts that humans have and which can be useful for some methods, but I’m don’t think we can make a much stronger claim at this time.)
In fact, I think that mech interp research was where the hypothesis “maybe neurons represent a single thing and we can understand neurons quite well (mostly) in isolation” was raised. And this hypothesis seems to look worse than a more default guess about NNs being hard to understand and there not being an easy way to decompose them into parts for analysis.
I basically agree with you. But I think we have some nontrivial information, given enough caveats.
I think there are four hypotheses:
1a. Neurons do >1 thing (neuron polysemanticity)
1b. Insofar as we can find interesting atomic semantic features, they have >1 neuron (feature polysemanticity)
2a. Features are sparse, i.e., (insofar as there exist interesting atomic semantic features), most have significantly <1/2 probability of being “on” for any input
2b. Features are superpositional, i.e., (insofar as there exist interesting atomic semantic features), there are significantly more than dimension-many of them at any layer.
I think 1a/b and 2a/b are different in subtle ways, but most people would agree that in a rough directional yes/no sense, 1a<=>1b and 2a<=>2b (note that 2a=>2b requires some caveats—but at the limit of lots of training in a complex problem with #training examples >= #parameters, if you have sparsity of features, it simply is inefficient to not have some form of superposition). I also agree that 1a/1b are a natural thing to posit a priori, and in fact it’s much more surprising to me as someone coming from math that the “1 neuron:1 feature” hypothesis has any directional validity at all (i.e., sometimes interesting features are quite sparse in the neuron basis), rather than that anything that looks like a linear feature is polysemantic.
Now to caveat my statement: I don’t think that neural nets are fully explained by a bunch of linear features, much less by a bunch of linear features in superposition. In fact, I’m not even close to100% on superposition existing at all in any truly “atomic” decomposition of computation. But at the same time we can clearly find semantic features which have explanatory power (in the same way that we can find pathways in biology, even if they don’t correspond to any structures on the fundamental, in this case cellular, level).
And when I say that “interesting semantic features exist in superposition”, what I really mean is that we have evidence for hypothesis 2a [edited, originally said 2b, which is a typo]. Namely, when we’re looking for unsupervised ways to get such features, it turns out that enforcing sparsity (and doing an SAE) gives better interp scores than doing PCA. I think this is pretty strong evidence!
it turns out that enforcing sparsity (and doing an SAE) gives better interp scores than doing PCA
It’s not clear to me this is true exactly. As in, suppose I want to explain as much of what a transformer is doing as possible with some amount of time. Would I better off looking at PCA features vs SAE features?
Yes, most/many SAE features are easier to understand than PCA features, but each SAE feature (which is actually sparse) is only a tiny, tiny fraction of what the model is doing. So, it might be that you’d get better interp scores (in terms of how much of what the model is doing) with PCA.
Certainly, if we do literal “fraction of loss explained by human written explanations” both PCA and SAE recover approximately 0% of training compute.
I do think you can often learn very specific more interesting things with SAEs and for various applications SAEs are more useful, but in terms of some broader understanding, I don’t think SAEs clearly are “better” than PCA. (There are also various cases where PCA on some particular distribution is totally the right tool for the job.)
Certainly, I don’t think it has been shown that we can get non-negligible interp scores with SAEs.
To be clear, I do think we learn something from the fact that SAE features seem to often/mostly at least roughly correspond to some human concept, but I think the fact that there are vastly more SAE features vs PCA features does matter! (PCA was never trying to decompose into this many parts.)
Yes—I generally agree with this. I also realized that “interp score” is ambiguous (and the true end-to-end interp score is negligible, I agree), but what’s more clearly true is that SAE features tend to be more interpretable. This might be largely explained by “people tend to think of interpretable features as branches of a decision tree, which are sparsely activating”. But also like it was surprising to me that the top SAE features are significantly more interpretable than top PCA features
So to elaborate: we get significantly more interpretable features if we enforce sparsity than if we just do more standard clustering procedures. This is nontrivial! Of course this might be saying more about our notions of “interpretable feature” and how we parse semantics; but I can certainly imagine a world where PCA gives much better results, and would have in fact by default expected this to be true for the “most important” features even if I believed in superposition.
So I’m somewhat comfortable saying that the fact that imposing sparsity works so well is telling us something. I don’t expect this to give “truly atomic” features from the network’s PoV (any more than understanding Newtonian physics tells us about the standard model), but this seems like nontrivial progress to me.
It sounds like we are not that far apart here. We’ve been doing some empirical work on toy systems to try to make the leap from mechanistic interpretability “stories” to semi-formal heuristic explanations. The max-of-k draft is an early example of this, and we have more ambitious work in progress along similar lines. I think of this work in a similar way to you: we are not trying to test empirical assumptions (in the way that some empirical work on frontier LLMs is, for example), but rather to learn from the process of putting our ideas into practice.
Thank you for the great response, and the (undeserved) praise of my criticism. I think it’s really good that you’re embracing the slightly unorthodox positions of sticking to ambitious convictions and acknowledging that this is unorthodox. I also really like your (a)-(d) (and agree that many of the adherents of the fields you list would benefit from similar lines of thinking).
I think we largely agree, and much of our disagreement probably boils down to where we draw the boundary between “mechanistic interpretability” and “other”. In particular, I fully agree with the first zoom level in your post, and with the causal structure of much of the rest of the diagram—in particular, I like your notions of alignment robustness and mechanism distinction (the latter of which I think is original to ARC) and I think they may be central in a good alignment scenario. I also think that some notion of LPE should be present. I have some reservations about ELK as ARC envisions it (also of the “too much backchaining” variety), but think that the first-order insights there are valuable.
I think the core cruxes we have are:
You write “Mechanistic interpretability has so far yielded very little in the way of beating baselines at downstream tasks”. If I understand this correctly, you’re saying it hasn’t yet led to engineering improvements, either in capabilities or in “prosaic alignment” (at least compared to baselines like RLHF or “more compute”).
While I agree with this, I think that this isn’t the right metric to apply. Indeed if you applied this metric, most science would not count as progress. Darwin wouldn’t get credit until his ideas got used to breed better crops and Einstein’s relativity would count as unproductive until the A-bomb (and the theory-application gap is much longer if you look at early advances in math and physics). Rather, I think that the question to ask is whether mechinterp (writ large, and in particular including a lot of people working in deep learning with no contact with safety) has made progress in understanding the internal functioning of AI or made nontrivially principled and falsifiable predictions about how it works. Here we would probably agree that the answer is pretty unambiguous. We have strong evidence that interesting semantic features exist in superposition (whether or not this is the way that the internal mechanisms use them). We understand the rough shape of some low-level circuits that do arithmetic and copying, and have rough ideas of the shapes of some high-level mechanisms (e.g. “function vectors”). To my eyes, this should count as progress in a very new science, and if I correctly understood your claim to be that you need to “beat black-box methods at useful tasks” to count as progress, I think this is too demanding.
I think that I’m onboard with you on your desideratum #1 that theories should be “primarily mathematical” – in the sense that I think our tastes for rigor and principled theoretical science are largely aligned (and we both agree that we need good and somewhat fundamental theoretical principles to avoid misalignment). But math isn’t magic. In order to get a good mathematical tool for a real-world context, you need to make sure that you have correctly specified the context where it is to be applied, and more generally that you’ve found the “right formal context” for math. This makes me want to be careful about context before moving on to your insight #2 of trying to guess a specific information-theoretic criterion for how to formalize “an interpretation”. Math is a dance, not a hammer: if a particular application of math isn’t working, it’s more likely that your context is wrong and you need to retarget and work outwards from simple examples, rather than try harder and route around contradictions. If you look at even a very mathy area of science, I would claim that most progress did not come from trying to make a very ambitious theoretical picture work and introducing epicycles in a “builder-breaker” fashion to get around roadblocks. For example if you look at the most mathematically heavy field that has applications in real life, this is QFT and SFT (which uses deep algebraic and topological insights and today is unquestionably useful in computer chips and the like). Its origin comes from physicists observing the idea of “universality” in some physical systems, and this leading Landau and others to work out that a special (though quite large and perturbation-invariant) class of statistical systems can be coarse-grained in a way that leads to these observed behaviors, and this led to ideas of renormalization, modern QFT and the like. If Landau’s generation instead tried to work really hard on mathematically analyzing general magnet-like systems without working up from applications and real-world systems, they’d end up in roughly the same place as Stephen Wolfram of trying to make overly ambitious claims about automata. The importance of looking for good theory-context fit is the main reason I would like to see more back-and-forth between more “boots-on-the-ground” interpretability theorists and more theoretical agendas like ARC and Agent Foundations. I’m optimistic that ARC’s mathematical agenda will eventually start iterating on carefully thinking about context and theory-context fit, but I think that some of the agenda I saw had the suboptimal, “use math as a hammer” shape. I might be misunderstanding here, and would welcome corrections.
More specifically about “stories”, I agree with you that we are unlikely to be able to tell an easy-to-understand story about the internal working of AI’s (and in particular, I am very onboard with your first-level zoom of scalable alignment). I agree that the ultimate form of the thing we’re both gesturing at in the guise of “interpretability” will be some complicated, fractally recursive formalism using a language we probably don’t currently possess. But I think this is sort of true in a lot of other science. Better understanding leads to formulas, ideas and tools with a recursive complexity that humanity wouldn’t have guessed at before discovering them (again, QFT/SFT is an example). I’m not saying that this means “understanding AI will have the same type signature as QFT/ as another science”. But I am saying that the thing it will look like will be some complicated novel shape that isn’t either modern interp or any currently-accessible guess at its final form. And indeed, if it does turn out to take the shape of something that we can guess today – for example if heuristic arguments or SAEs turn out to be a shot in the right direction – I would guess that the best route towards discovering this is to build up a pluralistic collection of ideas that both iterate on creating more elegant/more principled mathematical ideas and iterate on understanding iteratively more interesting pieces of iteratively more general ML models in some class that expands from toy or real-world models. The history of math also does include examples of more “hammer”-like people: e.g. Wiles and Perelman, so making this bet isn’t necessarily bad, and my criticism here should not be taken too prescriptively. In particular, I think your (a)-(d) are once again excellent guardrails against dangerous rabbitholes or communication gaps, and the only thing I can recommend somewhat confidently is to keep applicability to get interesting results about toy systems as a desideratum when building up the ambitious ideas.
Going a bit meta, I should flag an important intuition that we likely diverge on. I think that when some people defend using relatively formal math or philosophy to do alignment, they are going off of the following intuition:
if we restrict to real-world systems, we will be incorporating assumptions about the model class
if we assume these continue to hold for future systems by default, we are assuming some restrictive property remains true in complicated systems despite possible pressure to train against it to avoid detection, or more neutral pressures to learn new and more complex behaviors which break this property.
alternatively, if we try to impose this assumption externally, we will be restricting ourselves to a weaker, “understandable” class of algorithms that will be quickly outcompeted by more generic AI.
The thing I want to point out about this picture is that this models the assumption as closed. I.e., that it makes some exact requirement, like that some parameter is equal to zero. However, many of the most interesting assumptions in physics (including the one that made QFT go brrr, i.e., renormalizability) are open. I.e., they are some somewhat subtle assumptions that are perturbation-invariant and can’t be trained out (though they can be destroyed – in a clearly noticeable way – through new architectures or significant changes in complexity). In fact, there’s a core idea in physical theory, that I learned from some lecture notes of Ludvig Faddeev here, that you can trace through the development of physics as increasingly incorporating systems with more freedoms and introducing perturbations to a physical system starting with (essentially) classical fluid mechanics and tracing out through quantum mechanics → QFT, but always making sure you’re considering a class of systems that are “not too far” from more classical limits. The insight here is that just including more and more freedom and shifting in the directions of this freedom doesn’t get you into the maximal-complexity picture: rather, it gets you into an interesting picture that provably (for sufficiently small perturbations) allows for an interesting amount of complexity with excellent simplifications and coarse-grainings, and deep math.
Phrased less poetically, I’m making a distinction between something being robust and making no assumptions. When thinking mathematically about alignment, what we need is the former. In particular, I predict that if we study systems in the vicinity of realistic (or possibly even toy) systems, even counting on some amount of misalignment pressure, alien complexity, and so on, the pure math we get will be very different – and indeed, I think much more elegant – than if we impose no assumptions at all. I think that someone with this intuition can still be quite pessimistic, can ask for very high levels of mathematical formalism, but will still expect a very high amount of insight and progress from interacting with real-world systems.
Somewhat off-topic, but isn’t this a non-example:
I think a more accurate statement would be “We have a strong evidence that neurons don’t do a single ‘thing’ (either in the human ontology or in any other natural ontology)” combined with “We have a strong evidence that the residual stream represents more ‘things’ than it has dimensions”.
Aren’t both of these what people would (and did) predict without needing to look at models at all?[1] As in both of these are the null hypothesis in a certain sense. It would be kinda specific if neurons did do a single thing and we can rule out 1 “thing” per dimension in the residual stream via just noting that Transformers work at all.
I think there are more detailed models of a more specific thing called “superposition” within toy models, but I don’t think we have strong evidence of any very specific claim about larger AIs.
(SAE research has shown that SAEs often find directions which seem to at least roughly correspond to concepts that humans have and which can be useful for some methods, but I’m don’t think we can make a much stronger claim at this time.)
In fact, I think that mech interp research was where the hypothesis “maybe neurons represent a single thing and we can understand neurons quite well (mostly) in isolation” was raised. And this hypothesis seems to look worse than a more default guess about NNs being hard to understand and there not being an easy way to decompose them into parts for analysis.
I basically agree with you. But I think we have some nontrivial information, given enough caveats.
I think there are four hypotheses:
1a. Neurons do >1 thing (neuron polysemanticity)
1b. Insofar as we can find interesting atomic semantic features, they have >1 neuron (feature polysemanticity)
2a. Features are sparse, i.e., (insofar as there exist interesting atomic semantic features), most have significantly <1/2 probability of being “on” for any input
2b. Features are superpositional, i.e., (insofar as there exist interesting atomic semantic features), there are significantly more than dimension-many of them at any layer.
I think 1a/b and 2a/b are different in subtle ways, but most people would agree that in a rough directional yes/no sense, 1a<=>1b and 2a<=>2b (note that 2a=>2b requires some caveats—but at the limit of lots of training in a complex problem with #training examples >= #parameters, if you have sparsity of features, it simply is inefficient to not have some form of superposition). I also agree that 1a/1b are a natural thing to posit a priori, and in fact it’s much more surprising to me as someone coming from math that the “1 neuron:1 feature” hypothesis has any directional validity at all (i.e., sometimes interesting features are quite sparse in the neuron basis), rather than that anything that looks like a linear feature is polysemantic.
Now to caveat my statement: I don’t think that neural nets are fully explained by a bunch of linear features, much less by a bunch of linear features in superposition. In fact, I’m not even close to100% on superposition existing at all in any truly “atomic” decomposition of computation. But at the same time we can clearly find semantic features which have explanatory power (in the same way that we can find pathways in biology, even if they don’t correspond to any structures on the fundamental, in this case cellular, level).
And when I say that “interesting semantic features exist in superposition”, what I really mean is that we have evidence for hypothesis 2a [edited, originally said 2b, which is a typo]. Namely, when we’re looking for unsupervised ways to get such features, it turns out that enforcing sparsity (and doing an SAE) gives better interp scores than doing PCA. I think this is pretty strong evidence!
It’s not clear to me this is true exactly. As in, suppose I want to explain as much of what a transformer is doing as possible with some amount of time. Would I better off looking at PCA features vs SAE features?
Yes, most/many SAE features are easier to understand than PCA features, but each SAE feature (which is actually sparse) is only a tiny, tiny fraction of what the model is doing. So, it might be that you’d get better interp scores (in terms of how much of what the model is doing) with PCA.
Certainly, if we do literal “fraction of loss explained by human written explanations” both PCA and SAE recover approximately 0% of training compute.
I do think you can often learn very specific more interesting things with SAEs and for various applications SAEs are more useful, but in terms of some broader understanding, I don’t think SAEs clearly are “better” than PCA. (There are also various cases where PCA on some particular distribution is totally the right tool for the job.)
Certainly, I don’t think it has been shown that we can get non-negligible interp scores with SAEs.
To be clear, I do think we learn something from the fact that SAE features seem to often/mostly at least roughly correspond to some human concept, but I think the fact that there are vastly more SAE features vs PCA features does matter! (PCA was never trying to decompose into this many parts.)
Yes—I generally agree with this. I also realized that “interp score” is ambiguous (and the true end-to-end interp score is negligible, I agree), but what’s more clearly true is that SAE features tend to be more interpretable. This might be largely explained by “people tend to think of interpretable features as branches of a decision tree, which are sparsely activating”. But also like it was surprising to me that the top SAE features are significantly more interpretable than top PCA features
So to elaborate: we get significantly more interpretable features if we enforce sparsity than if we just do more standard clustering procedures. This is nontrivial! Of course this might be saying more about our notions of “interpretable feature” and how we parse semantics; but I can certainly imagine a world where PCA gives much better results, and would have in fact by default expected this to be true for the “most important” features even if I believed in superposition.
So I’m somewhat comfortable saying that the fact that imposing sparsity works so well is telling us something. I don’t expect this to give “truly atomic” features from the network’s PoV (any more than understanding Newtonian physics tells us about the standard model), but this seems like nontrivial progress to me.
I do think this was reasonably though not totally predictable ex-ante, but I agree.
It sounds like we are not that far apart here. We’ve been doing some empirical work on toy systems to try to make the leap from mechanistic interpretability “stories” to semi-formal heuristic explanations. The max-of-k draft is an early example of this, and we have more ambitious work in progress along similar lines. I think of this work in a similar way to you: we are not trying to test empirical assumptions (in the way that some empirical work on frontier LLMs is, for example), but rather to learn from the process of putting our ideas into practice.