Thanks for the analysis! This seems pretty persuasive to me, especially the argument that “fire as rarely as possible” could incentivise learning the same feature, and that it doesn’t trivially fall out of other dimensionality reduction methods. I think this predicts that if we look at the gradient with respect to the pre-activation value in MLP activation space, that the average of this will correspond to the rare feature direction? Though maybe not, since we want the average weighted by “how often does this cause a feature to flip from on to off”, there’s no incentive to go from −4 to −5.
An update is that when training on gelu-2l with the same parameters, I get truly dead features but fairly few ultra low features, and in one autoencoder (I think the final layer) the truly dead features are gone. This time I trained on mlp_output rather than mlp_activations, which is another possible difference.
Thanks for the analysis! This seems pretty persuasive to me, especially the argument that “fire as rarely as possible” could incentivise learning the same feature, and that it doesn’t trivially fall out of other dimensionality reduction methods. I think this predicts that if we look at the gradient with respect to the pre-activation value in MLP activation space, that the average of this will correspond to the rare feature direction? Though maybe not, since we want the average weighted by “how often does this cause a feature to flip from on to off”, there’s no incentive to go from −4 to −5.
An update is that when training on gelu-2l with the same parameters, I get truly dead features but fairly few ultra low features, and in one autoencoder (I think the final layer) the truly dead features are gone. This time I trained on mlp_output rather than mlp_activations, which is another possible difference.