If due to superposition, it proves advantageous to the AI to have a single feature that kind of does dog-head-detection and kind of does car-front-detection, because dog heads and car fronts don’t show up in the training data at the same time, so it can still get perfect loss through a properly constructed dual-purpose feature like this, it’d mean that to the AI, dog heads and car fronts are “the same thing”.
I don’t think that’s true. Imagine a toy scenario of two features that run through a 1D non-linear bottleneck before being reconstructed. Assuming that with some weight settings you can get superposition, the model is able to reconstruct the features ≈perfectly as long as they don’t appear together. That means the model can still differentiate the two features, they are different in the model’s ontology.
As AIs get more capable and general, I’d expect the concepts/features they use to start more closely matching the ones humans use in many domains.
My intuition disagrees here too. Whether we will observe superposition is a function of (number of “useful” features in the data), (sparsity of said features), and something like (bottleneck size).
It’s possible that bottleneck size will never be enough to compensate for number of features. Also it seems reasonable to me that ≈all of reality is extremely sparse in features, which presumably favors superposition.
Also it seems reasonable to me that ≈all of reality is extremely sparse in features, which presumably favors superposition.
Reality is usually sparse in features, and that‘s why even very small and simple intelligences can operate within it most of the time, so long as they don’t leave their narrow contexts. But the mark of a general intelligence is that it can operate even in highly out-of-distribution situations. Cars are usually driven on roads, so an intelligence could get by using a car even if its concepts of car-ness were all mixed up with its conception of roadness. But a human can plan to take a car to the moon and drive it on the dust there, and then do that. This indicates to me that a general intelligence needs to think in features that can compose to handle almost any data, not just data that usually appeared in the training distribution.
If your architectures has too many bottlenecks to allow this, I expect that it will not be able to become a human-level general intelligence.
(Parts of the human brain definitely seem narrow and specialised too of course, it‘s only the general reasoning capabilities that seem to have these ultra-factorising, nigh-universally applicable concepts.)
Note also that concepts humans use can totally be written as superpositions of other concepts too, most of these other concepts apparently just aren‘t very universally useful.
Reality is usually sparse in features, and that‘s why even very small and simple intelligences can operate within it most of the time, so long as they don’t leave their narrow contexts.
Reality is rich in features, but sparse in features that matter to a simple organism. That’s why context matters.
I don’t think that’s true. Imagine a toy scenario of two features that run through a 1D non-linear bottleneck before being reconstructed. Assuming that with some weight settings you can get superposition, the model is able to reconstruct the features ≈perfectly as long as they don’t appear together. That means the model can still differentiate the two features, they are different in the model’s ontology.
I’m not sure I understand this example. If I have a single 1-D feature, a floating point number that goes up with the amount of dog-headedness or car-frontness in a picture, then how can the model in a later layer reconstruct whether there was a dog-head xor a car-front in the image from that floating point number, unless it has other features that effectively contain this information?
Possibly the source of our disagreement here is that you are imagining the neuron ought to be strictly monotonically increasing in activation relative to the dog-headedness of the image?
If we abandon that assumption then it is relatively clear how to encode two numbers in 1D. Let’s assume we observe two numbers X,Y. With probability p, X=0,Y∼N(0,1), and with probability (1−p), Y=0,X∼N(0,1).
We now want to encode these two events in some third variable Z, such that we can perfectly reconstruct X,Y with probability ≈1.
I put the solution behind a spoiler for anyone wanting to try it on their own.
Choose some veeeery large μ≫1 (much greater than the variance of the normal distribution of the features). For the first event, set Z=Y−μ. For the second event, set Z=X+μ.
The decoding works as follows:
If Z is negative, then with probability ≈1 we are in the first scenario and we can set X=0,Y=Z+μ. Vice versa if Z is positive.
Ah, I see. Thank you for pointing this out. Do superposition features actually seem to work like this in practice in current networks? I was not aware of this.
In any case, for a network like the one you describe I would change my claim from
it’d mean that to the AI, dog heads and car fronts are “the same thing”.
to the AI having a concept for something humans don’t have a neat short description for. So for example, if your algorithm maps X>0 Y>0 to the first case, I’d call it a feature of “presence of dog heads or car fronts, or presence of car fronts”.
I don’t think this is an inherent problem for the theory. That a single floating point number can contain a lot of information is fine, so long as you have some way to measure how much it is.
Do superposition features actually seem to work like this in practice in current networks? I was not aware of this.
I’m not aware of any work that identifies superposition in exactly this way in NNs of practical use. As Spencer notes, you can verify that it does appear in certain toy settings though. Anthropic notes in their SoLU paper that they view their results as evidence for the SPH in LLMs. Imo the key part of the evidence here is that using a SoLU destroys performance but adding another LayerNorm afterwards solves that issue. The SoLU selects strongly against superposition and LayerNorm makes it possible again, which is some evidence that the way the LLM got to its performance was via superposition.
ETA: Ofc there could be some other mediating factor, too.
This example is meant to only illustrate how one could achieve this encoding. It’s not how an actual autoencoder would work. An actual NN might not even use superposition for the data I described and it might need some other setup to elicit this behavior. But to me it sounded like you are sceptical that superposition is nothing but the network being confused whereas I think it can be the correct way to still be able to reconstruct the features to a reasonable degree.
Not confused, just optimised to handle data of the kind seen in training, and with limited ability to generalise beyond that, compared to human vision.
I don’t think that’s true. Imagine a toy scenario of two features that run through a 1D non-linear bottleneck before being reconstructed. Assuming that with some weight settings you can get superposition, the model is able to reconstruct the features ≈perfectly as long as they don’t appear together. That means the model can still differentiate the two features, they are different in the model’s ontology.
My intuition disagrees here too. Whether we will observe superposition is a function of (number of “useful” features in the data), (sparsity of said features), and something like (bottleneck size). It’s possible that bottleneck size will never be enough to compensate for number of features. Also it seems reasonable to me that ≈all of reality is extremely sparse in features, which presumably favors superposition.
Reality is usually sparse in features, and that‘s why even very small and simple intelligences can operate within it most of the time, so long as they don’t leave their narrow contexts. But the mark of a general intelligence is that it can operate even in highly out-of-distribution situations. Cars are usually driven on roads, so an intelligence could get by using a car even if its concepts of car-ness were all mixed up with its conception of roadness. But a human can plan to take a car to the moon and drive it on the dust there, and then do that. This indicates to me that a general intelligence needs to think in features that can compose to handle almost any data, not just data that usually appeared in the training distribution.
If your architectures has too many bottlenecks to allow this, I expect that it will not be able to become a human-level general intelligence.
(Parts of the human brain definitely seem narrow and specialised too of course, it‘s only the general reasoning capabilities that seem to have these ultra-factorising, nigh-universally applicable concepts.)
Note also that concepts humans use can totally be written as superpositions of other concepts too, most of these other concepts apparently just aren‘t very universally useful.
Reality is rich in features, but sparse in features that matter to a simple organism. That’s why context matters.
I’m not sure I understand this example. If I have a single 1-D feature, a floating point number that goes up with the amount of dog-headedness or car-frontness in a picture, then how can the model in a later layer reconstruct whether there was a dog-head xor a car-front in the image from that floating point number, unless it has other features that effectively contain this information?
Possibly the source of our disagreement here is that you are imagining the neuron ought to be strictly monotonically increasing in activation relative to the dog-headedness of the image?
If we abandon that assumption then it is relatively clear how to encode two numbers in 1D. Let’s assume we observe two numbers X,Y. With probability p, X=0,Y∼N(0,1), and with probability (1−p), Y=0,X∼N(0,1).
We now want to encode these two events in some third variable Z, such that we can perfectly reconstruct X,Y with probability ≈1.
I put the solution behind a spoiler for anyone wanting to try it on their own.
Choose some veeeery large μ≫1 (much greater than the variance of the normal distribution of the features). For the first event, set Z=Y−μ. For the second event, set Z=X+μ.
The decoding works as follows:
If Z is negative, then with probability ≈1 we are in the first scenario and we can set X=0,Y=Z+μ. Vice versa if Z is positive.
Ah, I see. Thank you for pointing this out. Do superposition features actually seem to work like this in practice in current networks? I was not aware of this.
In any case, for a network like the one you describe I would change my claim from
to the AI having a concept for something humans don’t have a neat short description for. So for example, if your algorithm maps X>0 Y>0 to the first case, I’d call it a feature of “presence of dog heads or car fronts, or presence of car fronts”.
I don’t think this is an inherent problem for the theory. That a single floating point number can contain a lot of information is fine, so long as you have some way to measure how much it is.
I’m not aware of any work that identifies superposition in exactly this way in NNs of practical use.
As Spencer notes, you can verify that it does appear in certain toy settings though. Anthropic notes in their SoLU paper that they view their results as evidence for the SPH in LLMs. Imo the key part of the evidence here is that using a SoLU destroys performance but adding another LayerNorm afterwards solves that issue. The SoLU selects strongly against superposition and LayerNorm makes it possible again, which is some evidence that the way the LLM got to its performance was via superposition.
ETA: Ofc there could be some other mediating factor, too.
This example is meant to only illustrate how one could achieve this encoding. It’s not how an actual autoencoder would work. An actual NN might not even use superposition for the data I described and it might need some other setup to elicit this behavior.
But to me it sounded like you are sceptical that superposition is nothing but the network being confused whereas I think it can be the correct way to still be able to reconstruct the features to a reasonable degree.
Not confused, just optimised to handle data of the kind seen in training, and with limited ability to generalise beyond that, compared to human vision.
Yeah I agree with that. But there is also a sense in which some (many?) features will be inherently sparse.
A token is either the first one of multi-token word or it isn’t.
A word is either a noun, a verb or something else.
A word belongs to language LANG and not to any other language/has other meanings in those languages.
A H×W image can only contain so many objects which can only contain so many sub-aspects.
I don’t know what it would mean to go “out of distribution” in any of these cases.
This means that any network that has an incentive to conserve parameter usage (however we want to define that), might want to use superposition.