[Note] Is Superposition the reason for Polysemanticity? Lessons from “The Local Interaction Basis”
Superposition is currently the dominant hypothesis to explain polysemanticity in neural networks. However, how much better does it explain the data than alternative hypotheses?
Non-neuron aligned basis. The leading alternative, as asserted by Lawrence Chan here, is that there are not a very large number of underlying features; just that these features are not represented in a neuron-aligned way, so individual neurons appear to fire on multiple distinct features.
The Local Interaction Basis explores this idea in more depth. Starting from the premise that there is a linear and interpretable basis that is not overcomplete, they propose a method to recover such a basis, which works in toy models. However, empirical results in language models fail to demonstrate that the recovered basis is indeed more interpretable.
My conclusion from this is a big downwards update on the likelihood of the “non-neuron aligned basis” in realistic domains like natural language. The real world probably just is complex enough that there are tons of distinct features which represent reality.
You’ll enjoy reading What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity from Incidental Causes (link to the paper)
Using a combination of theory and experiments, we show that incidental polysemanticity can arise due to multiple reasons including regularization and neural noise; this incidental polysemanticity occurs because random initialization can, by chance alone, initially assign multiple features to the same neuron, and the training dynamics then strengthen such overlap.
[Note] Is Superposition the reason for Polysemanticity? Lessons from “The Local Interaction Basis”
Superposition is currently the dominant hypothesis to explain polysemanticity in neural networks. However, how much better does it explain the data than alternative hypotheses?
Non-neuron aligned basis. The leading alternative, as asserted by Lawrence Chan here, is that there are not a very large number of underlying features; just that these features are not represented in a neuron-aligned way, so individual neurons appear to fire on multiple distinct features.
The Local Interaction Basis explores this idea in more depth. Starting from the premise that there is a linear and interpretable basis that is not overcomplete, they propose a method to recover such a basis, which works in toy models. However, empirical results in language models fail to demonstrate that the recovered basis is indeed more interpretable.
My conclusion from this is a big downwards update on the likelihood of the “non-neuron aligned basis” in realistic domains like natural language. The real world probably just is complex enough that there are tons of distinct features which represent reality.
You’ll enjoy reading What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity from Incidental Causes (link to the paper)