Sorry for not noticing this earlier. I’m not very active on LessWrong/AF. In case it’s still helpful, a couple of thoughts...
Firstly, I think people shouldn’t take this graph too seriously! I made it for a talk in ~2017, I think and even then it was intended as a vague intuition, not something I was confident in. I do occasionally gesture at it as a possible intuition, but it’s just a vague idea which may or may not be true.
I do think there’s some kind of empirical trend where models in some cases become harder to understand and then easier. For example:
An MNIST model without a hidden layer (ie. a softmax generalized linear regression model) is easy to understand.
An MNIST model with a single hidden layer is very hard to understand.
Large convolutional MNIST/CIFAR models are perhaps slightly easier (you sometimes get nice convolutional filters).
AlexNet is noticeably more interpretable. You start to see quite a few human interpretable neurons.
InceptionV1 seems noticeably more interpretable still.
These are all qualitative observations / not rigorous / not systematic.
So what is going on? I think there are several hypotheses:
Confused abstractions / Modeling Limitations—Weak models can’t express the true abstractions, so they use a mixture of bad heuristics that are hard to reason about and much more complicated than the right thing. This is the original idea behind this curve.
Overfitting—Perhaps it has less to do with the models and more to do with the data. See this paper where overfitting seems to make features less interpretable. So ImageNet models may be more interpretable because the dataset is harder and there’s less extreme overfitting.
Superposition—It may be that apparent interpretability is actually all a result of superposition. For some reason, the later models have less superposition. For example, there are cases where larger models (with more neurons) may store less features in superposition.
Overfitting—Superposition Interaction—There’s reason to believe that overfitting may heavily exploit superposition (overfit features detect very specific cases and so are very sparse, which is ideal for superposition). Thus, models which are overfit may be difficult to interpret, not because overfit features are intrinsically uninterpretable but because they cause superposition.
Other Interaction Effects—It may be that there are other interactions between the above arguments. For example, perhaps modeling limitations cause the model to represent weird heuristics, which then more heavily exploit superpositition.
Idiosyncrasies of Vision—It may be that the observations which motivated this curve are idiosyncratic to vision models, or maybe just the models I was studying.
I suspect it’s a mix of quite a few of these.
In the case of language models, I think superposition is really the driving force and is quite different from the vision case (language model features seem to typically be much sparser than vision model ones).
Sorry for not noticing this earlier. I’m not very active on LessWrong/AF. In case it’s still helpful, a couple of thoughts...
Firstly, I think people shouldn’t take this graph too seriously! I made it for a talk in ~2017, I think and even then it was intended as a vague intuition, not something I was confident in. I do occasionally gesture at it as a possible intuition, but it’s just a vague idea which may or may not be true.
I do think there’s some kind of empirical trend where models in some cases become harder to understand and then easier. For example:
An MNIST model without a hidden layer (ie. a softmax generalized linear regression model) is easy to understand.
An MNIST model with a single hidden layer is very hard to understand.
Large convolutional MNIST/CIFAR models are perhaps slightly easier (you sometimes get nice convolutional filters).
AlexNet is noticeably more interpretable. You start to see quite a few human interpretable neurons.
InceptionV1 seems noticeably more interpretable still.
These are all qualitative observations / not rigorous / not systematic.
So what is going on? I think there are several hypotheses:
Confused abstractions / Modeling Limitations—Weak models can’t express the true abstractions, so they use a mixture of bad heuristics that are hard to reason about and much more complicated than the right thing. This is the original idea behind this curve.
Overfitting—Perhaps it has less to do with the models and more to do with the data. See this paper where overfitting seems to make features less interpretable. So ImageNet models may be more interpretable because the dataset is harder and there’s less extreme overfitting.
Superposition—It may be that apparent interpretability is actually all a result of superposition. For some reason, the later models have less superposition. For example, there are cases where larger models (with more neurons) may store less features in superposition.
Overfitting—Superposition Interaction—There’s reason to believe that overfitting may heavily exploit superposition (overfit features detect very specific cases and so are very sparse, which is ideal for superposition). Thus, models which are overfit may be difficult to interpret, not because overfit features are intrinsically uninterpretable but because they cause superposition.
Other Interaction Effects—It may be that there are other interactions between the above arguments. For example, perhaps modeling limitations cause the model to represent weird heuristics, which then more heavily exploit superpositition.
Idiosyncrasies of Vision—It may be that the observations which motivated this curve are idiosyncratic to vision models, or maybe just the models I was studying.
I suspect it’s a mix of quite a few of these.
In the case of language models, I think superposition is really the driving force and is quite different from the vision case (language model features seem to typically be much sparser than vision model ones).