My guess is that this result is very sensitive to the design of the training dataset:
the input/output data pairs are (ei,ei) for i∈[n], where ei∈Rn is the ith basis vector.
In particular, I think it is likely very sensitive to the implicit assumption that feature i and feature j never co-occur on a single input. I’d be interested to see experiments where each feature is turned on with some (not too small) probability, independently of all other features, similarly to the original toy models setting. This would result in some inputs where feature i and j are on simultaneously. My prediction would be that polysemanticity goes down very significantly (probably to zero if the probabilities are high enough and the training is done for long enough).
I also don’t understand why L1 regularization on activations is necessary to show incidental polysemanticity given your setup. Even if you remove the L1 regularization on activations, it is still the case that “benign collisions” impose no cost on the model, since feature i and feature j are never simultaneously present in a given input. So if you do get a benign collision, what causes it to go away? Overall my expectation would be that without the L1 regularization on activations (and with the training dataset as described in this post), you’d get a complicated mess where every neuron is highly polysemantic, i.e. even more polysemanticity than described in this post. Why is that wrong?
In particular, I think it is likely very sensitive to the implicit assumption that feature i and feature j never co-occur on a single input.
Definitely! I still think that this assumption is fairly realistic because in practice, most pairs of unrelated features would co-occur only very rarely, and I expect the winner-take-all dynamic to dominate most of the time. But I agree that it would be nice to quantify this and test it out.
Overall my expectation would be that without the L1 regularization on activations (and with the training dataset as described in this post), you’d get a complicated mess where every neuron is highly polysemantic, i.e. even more polysemanticity than described in this post. Why is that wrong?
If there is no L1 regularization on activations, then every hidden neuron would indeed be highly “polysemantic” in the sense that it has nonzero weights for each input feature. But on the other hand, the whole encoding space would become rotationally symmetric, and when that’s the case it feels like polysemanticity shouldn’t be about individual neurons (since the canonical basis is not special anymore) and instead about the angles that different encodings form. In particular, as long as mgen, the space of optimal solutions for this setup requires the encodings Wi to form angles of at least 90° with each other, and it’s unclear whether we should call this polysemantic.
So one of the reasons why we need L1 regularization is to break the rotational symmetry and create a privileged basis: that way, it’s actually meaningful to ask whether a particular hidden neuron is representing more than one feature.
Good point on the rotational symmetry, that makes sense now.
I still think that this assumption is fairly realistic because in practice, most pairs of unrelated features would co-occur only very rarely, and I expect the winner-take-all dynamic to dominate most of the time. But I agree that it would be nice to quantify this and test it out.
Agreed that’s a plausible hypothesis. I mostly wish that in this toy model you had a hyperparameter for the frequency of co-occurrence of features, and identified how it affects the rate of incidental polysemanticity.
My guess is that this result is very sensitive to the design of the training dataset:
In particular, I think it is likely very sensitive to the implicit assumption that feature i and feature j never co-occur on a single input. I’d be interested to see experiments where each feature is turned on with some (not too small) probability, independently of all other features, similarly to the original toy models setting. This would result in some inputs where feature i and j are on simultaneously. My prediction would be that polysemanticity goes down very significantly (probably to zero if the probabilities are high enough and the training is done for long enough).
I also don’t understand why L1 regularization on activations is necessary to show incidental polysemanticity given your setup. Even if you remove the L1 regularization on activations, it is still the case that “benign collisions” impose no cost on the model, since feature i and feature j are never simultaneously present in a given input. So if you do get a benign collision, what causes it to go away? Overall my expectation would be that without the L1 regularization on activations (and with the training dataset as described in this post), you’d get a complicated mess where every neuron is highly polysemantic, i.e. even more polysemanticity than described in this post. Why is that wrong?
Thanks for the feedback!
Definitely! I still think that this assumption is fairly realistic because in practice, most pairs of unrelated features would co-occur only very rarely, and I expect the winner-take-all dynamic to dominate most of the time. But I agree that it would be nice to quantify this and test it out.
If there is no L1 regularization on activations, then every hidden neuron would indeed be highly “polysemantic” in the sense that it has nonzero weights for each input feature. But on the other hand, the whole encoding space would become rotationally symmetric, and when that’s the case it feels like polysemanticity shouldn’t be about individual neurons (since the canonical basis is not special anymore) and instead about the angles that different encodings form. In particular, as long as mgen, the space of optimal solutions for this setup requires the encodings Wi to form angles of at least 90° with each other, and it’s unclear whether we should call this polysemantic.
So one of the reasons why we need L1 regularization is to break the rotational symmetry and create a privileged basis: that way, it’s actually meaningful to ask whether a particular hidden neuron is representing more than one feature.
Good point on the rotational symmetry, that makes sense now.
Agreed that’s a plausible hypothesis. I mostly wish that in this toy model you had a hyperparameter for the frequency of co-occurrence of features, and identified how it affects the rate of incidental polysemanticity.