We have strong evidence that interesting semantic features exist in superposition
I think a more accurate statement would be “We have a strong evidence that neurons don’t do a single ‘thing’ (either in the human ontology or in any other natural ontology)” combined with “We have a strong evidence that the residual stream represents more ‘things’ than it has dimensions”.
Aren’t both of these what people would (and did) predict without needing to look at models at all?[1] As in both of these are the null hypothesis in a certain sense. It would be kinda specific if neurons did do a single thing and we can rule out 1 “thing” per dimension in the residual stream via just noting that Transformers work at all.
I think there are more detailed models of a more specific thing called “superposition” within toy models, but I don’t think we have strong evidence of any very specific claim about larger AIs.
(SAE research has shown that SAEs often find directions which seem to at least roughly correspond to concepts that humans have and which can be useful for some methods, but I’m don’t think we can make a much stronger claim at this time.)
In fact, I think that mech interp research was where the hypothesis “maybe neurons represent a single thing and we can understand neurons quite well (mostly) in isolation” was raised. And this hypothesis seems to look worse than a more default guess about NNs being hard to understand and there not being an easy way to decompose them into parts for analysis.
I basically agree with you. But I think we have some nontrivial information, given enough caveats.
I think there are four hypotheses:
1a. Neurons do >1 thing (neuron polysemanticity)
1b. Insofar as we can find interesting atomic semantic features, they have >1 neuron (feature polysemanticity)
2a. Features are sparse, i.e., (insofar as there exist interesting atomic semantic features), most have significantly <1/2 probability of being “on” for any input
2b. Features are superpositional, i.e., (insofar as there exist interesting atomic semantic features), there are significantly more than dimension-many of them at any layer.
I think 1a/b and 2a/b are different in subtle ways, but most people would agree that in a rough directional yes/no sense, 1a<=>1b and 2a<=>2b (note that 2a=>2b requires some caveats—but at the limit of lots of training in a complex problem with #training examples >= #parameters, if you have sparsity of features, it simply is inefficient to not have some form of superposition). I also agree that 1a/1b are a natural thing to posit a priori, and in fact it’s much more surprising to me as someone coming from math that the “1 neuron:1 feature” hypothesis has any directional validity at all (i.e., sometimes interesting features are quite sparse in the neuron basis), rather than that anything that looks like a linear feature is polysemantic.
Now to caveat my statement: I don’t think that neural nets are fully explained by a bunch of linear features, much less by a bunch of linear features in superposition. In fact, I’m not even close to100% on superposition existing at all in any truly “atomic” decomposition of computation. But at the same time we can clearly find semantic features which have explanatory power (in the same way that we can find pathways in biology, even if they don’t correspond to any structures on the fundamental, in this case cellular, level).
And when I say that “interesting semantic features exist in superposition”, what I really mean is that we have evidence for hypothesis 2a [edited, originally said 2b, which is a typo]. Namely, when we’re looking for unsupervised ways to get such features, it turns out that enforcing sparsity (and doing an SAE) gives better interp scores than doing PCA. I think this is pretty strong evidence!
it turns out that enforcing sparsity (and doing an SAE) gives better interp scores than doing PCA
It’s not clear to me this is true exactly. As in, suppose I want to explain as much of what a transformer is doing as possible with some amount of time. Would I better off looking at PCA features vs SAE features?
Yes, most/many SAE features are easier to understand than PCA features, but each SAE feature (which is actually sparse) is only a tiny, tiny fraction of what the model is doing. So, it might be that you’d get better interp scores (in terms of how much of what the model is doing) with PCA.
Certainly, if we do literal “fraction of loss explained by human written explanations” both PCA and SAE recover approximately 0% of training compute.
I do think you can often learn very specific more interesting things with SAEs and for various applications SAEs are more useful, but in terms of some broader understanding, I don’t think SAEs clearly are “better” than PCA. (There are also various cases where PCA on some particular distribution is totally the right tool for the job.)
Certainly, I don’t think it has been shown that we can get non-negligible interp scores with SAEs.
To be clear, I do think we learn something from the fact that SAE features seem to often/mostly at least roughly correspond to some human concept, but I think the fact that there are vastly more SAE features vs PCA features does matter! (PCA was never trying to decompose into this many parts.)
Yes—I generally agree with this. I also realized that “interp score” is ambiguous (and the true end-to-end interp score is negligible, I agree), but what’s more clearly true is that SAE features tend to be more interpretable. This might be largely explained by “people tend to think of interpretable features as branches of a decision tree, which are sparsely activating”. But also like it was surprising to me that the top SAE features are significantly more interpretable than top PCA features
So to elaborate: we get significantly more interpretable features if we enforce sparsity than if we just do more standard clustering procedures. This is nontrivial! Of course this might be saying more about our notions of “interpretable feature” and how we parse semantics; but I can certainly imagine a world where PCA gives much better results, and would have in fact by default expected this to be true for the “most important” features even if I believed in superposition.
So I’m somewhat comfortable saying that the fact that imposing sparsity works so well is telling us something. I don’t expect this to give “truly atomic” features from the network’s PoV (any more than understanding Newtonian physics tells us about the standard model), but this seems like nontrivial progress to me.
Somewhat off-topic, but isn’t this a non-example:
I think a more accurate statement would be “We have a strong evidence that neurons don’t do a single ‘thing’ (either in the human ontology or in any other natural ontology)” combined with “We have a strong evidence that the residual stream represents more ‘things’ than it has dimensions”.
Aren’t both of these what people would (and did) predict without needing to look at models at all?[1] As in both of these are the null hypothesis in a certain sense. It would be kinda specific if neurons did do a single thing and we can rule out 1 “thing” per dimension in the residual stream via just noting that Transformers work at all.
I think there are more detailed models of a more specific thing called “superposition” within toy models, but I don’t think we have strong evidence of any very specific claim about larger AIs.
(SAE research has shown that SAEs often find directions which seem to at least roughly correspond to concepts that humans have and which can be useful for some methods, but I’m don’t think we can make a much stronger claim at this time.)
In fact, I think that mech interp research was where the hypothesis “maybe neurons represent a single thing and we can understand neurons quite well (mostly) in isolation” was raised. And this hypothesis seems to look worse than a more default guess about NNs being hard to understand and there not being an easy way to decompose them into parts for analysis.
I basically agree with you. But I think we have some nontrivial information, given enough caveats.
I think there are four hypotheses:
1a. Neurons do >1 thing (neuron polysemanticity)
1b. Insofar as we can find interesting atomic semantic features, they have >1 neuron (feature polysemanticity)
2a. Features are sparse, i.e., (insofar as there exist interesting atomic semantic features), most have significantly <1/2 probability of being “on” for any input
2b. Features are superpositional, i.e., (insofar as there exist interesting atomic semantic features), there are significantly more than dimension-many of them at any layer.
I think 1a/b and 2a/b are different in subtle ways, but most people would agree that in a rough directional yes/no sense, 1a<=>1b and 2a<=>2b (note that 2a=>2b requires some caveats—but at the limit of lots of training in a complex problem with #training examples >= #parameters, if you have sparsity of features, it simply is inefficient to not have some form of superposition). I also agree that 1a/1b are a natural thing to posit a priori, and in fact it’s much more surprising to me as someone coming from math that the “1 neuron:1 feature” hypothesis has any directional validity at all (i.e., sometimes interesting features are quite sparse in the neuron basis), rather than that anything that looks like a linear feature is polysemantic.
Now to caveat my statement: I don’t think that neural nets are fully explained by a bunch of linear features, much less by a bunch of linear features in superposition. In fact, I’m not even close to100% on superposition existing at all in any truly “atomic” decomposition of computation. But at the same time we can clearly find semantic features which have explanatory power (in the same way that we can find pathways in biology, even if they don’t correspond to any structures on the fundamental, in this case cellular, level).
And when I say that “interesting semantic features exist in superposition”, what I really mean is that we have evidence for hypothesis 2a [edited, originally said 2b, which is a typo]. Namely, when we’re looking for unsupervised ways to get such features, it turns out that enforcing sparsity (and doing an SAE) gives better interp scores than doing PCA. I think this is pretty strong evidence!
It’s not clear to me this is true exactly. As in, suppose I want to explain as much of what a transformer is doing as possible with some amount of time. Would I better off looking at PCA features vs SAE features?
Yes, most/many SAE features are easier to understand than PCA features, but each SAE feature (which is actually sparse) is only a tiny, tiny fraction of what the model is doing. So, it might be that you’d get better interp scores (in terms of how much of what the model is doing) with PCA.
Certainly, if we do literal “fraction of loss explained by human written explanations” both PCA and SAE recover approximately 0% of training compute.
I do think you can often learn very specific more interesting things with SAEs and for various applications SAEs are more useful, but in terms of some broader understanding, I don’t think SAEs clearly are “better” than PCA. (There are also various cases where PCA on some particular distribution is totally the right tool for the job.)
Certainly, I don’t think it has been shown that we can get non-negligible interp scores with SAEs.
To be clear, I do think we learn something from the fact that SAE features seem to often/mostly at least roughly correspond to some human concept, but I think the fact that there are vastly more SAE features vs PCA features does matter! (PCA was never trying to decompose into this many parts.)
Yes—I generally agree with this. I also realized that “interp score” is ambiguous (and the true end-to-end interp score is negligible, I agree), but what’s more clearly true is that SAE features tend to be more interpretable. This might be largely explained by “people tend to think of interpretable features as branches of a decision tree, which are sparsely activating”. But also like it was surprising to me that the top SAE features are significantly more interpretable than top PCA features
So to elaborate: we get significantly more interpretable features if we enforce sparsity than if we just do more standard clustering procedures. This is nontrivial! Of course this might be saying more about our notions of “interpretable feature” and how we parse semantics; but I can certainly imagine a world where PCA gives much better results, and would have in fact by default expected this to be true for the “most important” features even if I believed in superposition.
So I’m somewhat comfortable saying that the fact that imposing sparsity works so well is telling us something. I don’t expect this to give “truly atomic” features from the network’s PoV (any more than understanding Newtonian physics tells us about the standard model), but this seems like nontrivial progress to me.
I do think this was reasonably though not totally predictable ex-ante, but I agree.