My hunch about the ultra-rare features is that they’re trying to become fully dead features, but haven’t gotten there yet. Some reasons to believe this:
Anthropic mentions that “if we increase the number of training steps then networks will kill off more of these ultralow density neurons.”
The “dying” process gets slower as the feature gets closer to fully dead, since the weights only get updated when the feature fires. It may take a huge number of steps to cross the last mile between “very rare” and “dead,” and unless we’ve trained that much, we will find features that really ought to be dead in an ultra-rare state instead.
Anthropic includes a 3D plot of log density, bias, and the dot product of each feature’s enc and dec vectors (“D/E projection”).
In the run that’s plotted, the ultra-rare cluster is distinguished by a combination of low density, large negative biases, and a broad distribution of D/E projection that’s ~symmetric around 0. For high-density features, the D/E projections are tightly concentrated near 1.
Large negative bias makes sense for features that are trying to never activate.
D/E projection near 1 seems intuitive for a feature that’s actually autoencoding a signal. Thus, values far from 1 might indicate that a feature is not doing any useful autoencoding work[1][2].
I plotted these quantities for the checkpointed loaded in your Colab. Oddly, the ultra-rare cluster did not have large(r) negative biases—though the distribution was different. But the D/E projection distributions looked very similar to Anthropic’s.
If we’re trying to make a feature fire as rarely as possible, and have as little effect as possible when it does fire, then the optimal value for the encoder weight is something like argminwEx∼DReLU(w⋅x+b). In other words, we’re trying to find a hyperplane where the data is all on one side, or as close to that as possible. If the b-dependence is not very strong (which could be the case in practice), then:
there’s some optimal encoder weight w that all the dying neurons will converge towards
the nonlinearity will make it hard to find this value with purely linear algebraic tools, which explains why it doesn’t pop out of an SVD or the like
the value is chosen to suppress firing as much as possible in aggregate, not to make firing happen on any particular subset of the data, which explains why the firing pattern is not interpretable
there could easily be more than one orthogonal hyperplane such that almost all the data is on one side, which explains why the weights all converge to some new direction when the original one is prohibited
To test this hypothesis, I guess we could watch how density evolves for rare features over training, up until the point where they are re-initialized? Maybe choose a random subset of them to not re-initialize, and then watch them?
I’d expect these features to get steadily rarer over time, and to never reach some “equilibrium rarity” at which they stop getting rarer. (On this hypothesis, the actual log-density we observe for an ultra-rare feature is an artifact of the training step—it’s not useful for autoencoding that this feature activates on exactly one in 1e-6 tokens or whatever, it’s simply that we have not waited long enough for the density to become 1e-7, then 1e-8, etc.)
Intuitively, when such a “useless” feature fires in training, the W_enc gradient is dominated by the L1 term and tries to get the feature to stop firing, while the W_dec gradient is trying to stop the feature from interfering with the useful ones if it does fire. There’s no obvious reason these should have similar directions.
Although it’s conceivable that the ultra-rare features are “conspiring” to do useful work collectively, in a very different way from how the high-density features do useful work.
Thanks for the analysis! This seems pretty persuasive to me, especially the argument that “fire as rarely as possible” could incentivise learning the same feature, and that it doesn’t trivially fall out of other dimensionality reduction methods. I think this predicts that if we look at the gradient with respect to the pre-activation value in MLP activation space, that the average of this will correspond to the rare feature direction? Though maybe not, since we want the average weighted by “how often does this cause a feature to flip from on to off”, there’s no incentive to go from −4 to −5.
An update is that when training on gelu-2l with the same parameters, I get truly dead features but fairly few ultra low features, and in one autoencoder (I think the final layer) the truly dead features are gone. This time I trained on mlp_output rather than mlp_activations, which is another possible difference.
My hunch about the ultra-rare features is that they’re trying to become fully dead features, but haven’t gotten there yet. Some reasons to believe this:
Anthropic mentions that “if we increase the number of training steps then networks will kill off more of these ultralow density neurons.”
The “dying” process gets slower as the feature gets closer to fully dead, since the weights only get updated when the feature fires. It may take a huge number of steps to cross the last mile between “very rare” and “dead,” and unless we’ve trained that much, we will find features that really ought to be dead in an ultra-rare state instead.
Anthropic includes a 3D plot of log density, bias, and the dot product of each feature’s enc and dec vectors (“D/E projection”).
In the run that’s plotted, the ultra-rare cluster is distinguished by a combination of low density, large negative biases, and a broad distribution of D/E projection that’s ~symmetric around 0. For high-density features, the D/E projections are tightly concentrated near 1.
Large negative bias makes sense for features that are trying to never activate.
D/E projection near 1 seems intuitive for a feature that’s actually autoencoding a signal. Thus, values far from 1 might indicate that a feature is not doing any useful autoencoding work[1][2].
I plotted these quantities for the checkpointed loaded in your Colab. Oddly, the ultra-rare cluster did not have large(r) negative biases—though the distribution was different. But the D/E projection distributions looked very similar to Anthropic’s.
If we’re trying to make a feature fire as rarely as possible, and have as little effect as possible when it does fire, then the optimal value for the encoder weight is something like argminw Ex∼D ReLU(w⋅x+b). In other words, we’re trying to find a hyperplane where the data is all on one side, or as close to that as possible. If the b-dependence is not very strong (which could be the case in practice), then:
there’s some optimal encoder weight w that all the dying neurons will converge towards
the nonlinearity will make it hard to find this value with purely linear algebraic tools, which explains why it doesn’t pop out of an SVD or the like
the value is chosen to suppress firing as much as possible in aggregate, not to make firing happen on any particular subset of the data, which explains why the firing pattern is not interpretable
there could easily be more than one orthogonal hyperplane such that almost all the data is on one side, which explains why the weights all converge to some new direction when the original one is prohibited
To test this hypothesis, I guess we could watch how density evolves for rare features over training, up until the point where they are re-initialized? Maybe choose a random subset of them to not re-initialize, and then watch them?
I’d expect these features to get steadily rarer over time, and to never reach some “equilibrium rarity” at which they stop getting rarer. (On this hypothesis, the actual log-density we observe for an ultra-rare feature is an artifact of the training step—it’s not useful for autoencoding that this feature activates on exactly one in 1e-6 tokens or whatever, it’s simply that we have not waited long enough for the density to become 1e-7, then 1e-8, etc.)
Intuitively, when such a “useless” feature fires in training, the W_enc gradient is dominated by the L1 term and tries to get the feature to stop firing, while the W_dec gradient is trying to stop the feature from interfering with the useful ones if it does fire. There’s no obvious reason these should have similar directions.
Although it’s conceivable that the ultra-rare features are “conspiring” to do useful work collectively, in a very different way from how the high-density features do useful work.
Thanks for the analysis! This seems pretty persuasive to me, especially the argument that “fire as rarely as possible” could incentivise learning the same feature, and that it doesn’t trivially fall out of other dimensionality reduction methods. I think this predicts that if we look at the gradient with respect to the pre-activation value in MLP activation space, that the average of this will correspond to the rare feature direction? Though maybe not, since we want the average weighted by “how often does this cause a feature to flip from on to off”, there’s no incentive to go from −4 to −5.
An update is that when training on gelu-2l with the same parameters, I get truly dead features but fairly few ultra low features, and in one autoencoder (I think the final layer) the truly dead features are gone. This time I trained on mlp_output rather than mlp_activations, which is another possible difference.