The way I would phrase this concern is “SAEs might learn to pick up on structure present in the underlying data, rather than to pick up on learned structure in NN activations.” E.g. since “tree” is a class of things defined by a bunch of correlations present in the underlying image data, it’s possible that images of trees will naturally cluster in NN activations even when the NN has no underlying tree concept; SAEs would still be able to detect and learn this cluster as one of their neurons.
I agree this is a valid critique. Here’s one empirical test which partially gets at it: what happens when you train an SAE on a NN with random weights? (I.e. you randomize the parameters of your NN, and then train an SAE on its activations on real data in the normal way.) Then to the extent that your SAE has good-looking features, that must be because your SAE was picking up on structure in the underlying data.
My collaborators and I did this experiment. In more detail, we trained SAEs on Pythia-70m’s MLPs, then did this again but after randomizing the weights of Pythia-70m. Take a moment to predict the results if you want etc etc.
The SAEs that we trained on a random network looked bad. The most interesting dictionary features we found were features that activated on particular tokens (e.g. features that activated on the “man” feature and no others). Most of the features didn’t look like anything at all, activating on a large fraction (>10%) of tokens in our data, with no obvious patterns.(The features for dictionaries trained on the non-random network looked much better.)
We also did a variant of this experiment where use randomized Pythia-70m’s parameters except for the embedding layer. In this variant, the most interesting features we found were features which fired on a few closely semantically related tokens (e.g. the tokens “make,” “makes,” and “making”).
Thanks to my collaborators for this experiment: Aaron Mueller and David Bau.
I agree that a reasonable intuition for what SAEs do is: identify “basic clusters” in NN activations (basic in the sense that you allow compositionality, i.e. you don’t try to learn clusters whose centroids are the sums of the centroids of previously-learned clusters). And these clusters might exist because:
your NN has learned concepts and these clusters correspond to concepts (what we hope is the reason), or
because of correlations present in your underlying data (the thing that you seem to be worried about).
Beyond the preliminary empirics I mentioned above, I think there are some theoretical reasons to hope that SAEs will mostly learn the first type of cluster:
Most clusters in NN activations on real data might be of the first type
This is because the NN has already, during training, noticed various correlations in the data and formed concepts around them (to the extent that these concepts were useful for getting low loss, which they typically will be if your model is trained on next-token prediction (a task which incentivizes you to model all the correlations)).
Clusters of the second type might not have any interesting compositional structure, but your SAE gets bonus points for learning clusters which participate in compositional structure.
E.g. If there are five clusters with centroids w, x, y, z, and y + z and your SAE can only learn 2 of them, then it would prefer to learn the clusters with centroids y and z (because then it can model the cluster with centroid y + z for free).
(The results for correlations from auto-interp are less clear: they find similar correlation coefficients with and without weight randomization. However, they find that this might be due to single token features on the part of the randomized transformer and when you ignore these features (or correct in some other way I’m forgetting?), the SAE on an actual transformer indeed has higher correlation.)
Another metric is: comparing the similarity between two dictionaries using mean max cosine similarity (where one of the dictionaries is treated as the ground truth), we’ve found that two dictionaries trained from different random seeds on the same (non-randomized) model are highly similar (>.95), whereas dictionaries trained on a randomized model and an non-randomized model are dissimilar (<.3 IIRC, but I don’t have the data on hand).
The way I would phrase this concern is “SAEs might learn to pick up on structure present in the underlying data, rather than to pick up on learned structure in NN activations.” E.g. since “tree” is a class of things defined by a bunch of correlations present in the underlying image data, it’s possible that images of trees will naturally cluster in NN activations even when the NN has no underlying tree concept; SAEs would still be able to detect and learn this cluster as one of their neurons.
I agree this is a valid critique. Here’s one empirical test which partially gets at it: what happens when you train an SAE on a NN with random weights? (I.e. you randomize the parameters of your NN, and then train an SAE on its activations on real data in the normal way.) Then to the extent that your SAE has good-looking features, that must be because your SAE was picking up on structure in the underlying data.
My collaborators and I did this experiment. In more detail, we trained SAEs on Pythia-70m’s MLPs, then did this again but after randomizing the weights of Pythia-70m. Take a moment to predict the results if you want etc etc.
The SAEs that we trained on a random network looked bad. The most interesting dictionary features we found were features that activated on particular tokens (e.g. features that activated on the “man” feature and no others). Most of the features didn’t look like anything at all, activating on a large fraction (>10%) of tokens in our data, with no obvious patterns.(The features for dictionaries trained on the non-random network looked much better.)
We also did a variant of this experiment where use randomized Pythia-70m’s parameters except for the embedding layer. In this variant, the most interesting features we found were features which fired on a few closely semantically related tokens (e.g. the tokens “make,” “makes,” and “making”).
Thanks to my collaborators for this experiment: Aaron Mueller and David Bau.
I agree that a reasonable intuition for what SAEs do is: identify “basic clusters” in NN activations (basic in the sense that you allow compositionality, i.e. you don’t try to learn clusters whose centroids are the sums of the centroids of previously-learned clusters). And these clusters might exist because:
your NN has learned concepts and these clusters correspond to concepts (what we hope is the reason), or
because of correlations present in your underlying data (the thing that you seem to be worried about).
Beyond the preliminary empirics I mentioned above, I think there are some theoretical reasons to hope that SAEs will mostly learn the first type of cluster:
Most clusters in NN activations on real data might be of the first type
This is because the NN has already, during training, noticed various correlations in the data and formed concepts around them (to the extent that these concepts were useful for getting low loss, which they typically will be if your model is trained on next-token prediction (a task which incentivizes you to model all the correlations)).
Clusters of the second type might not have any interesting compositional structure, but your SAE gets bonus points for learning clusters which participate in compositional structure.
E.g. If there are five clusters with centroids w, x, y, z, and y + z and your SAE can only learn 2 of them, then it would prefer to learn the clusters with centroids y and z (because then it can model the cluster with centroid y + z for free).
In Towards Monosemanticity we also did a version of this experiment, and found that the SAE was much less interpretable when the transformer weights were randomized (https://transformer-circuits.pub/2023/monosemantic-features/index.html#appendix-automated-randomized).
(The results for correlations from auto-interp are less clear: they find similar correlation coefficients with and without weight randomization. However, they find that this might be due to single token features on the part of the randomized transformer and when you ignore these features (or correct in some other way I’m forgetting?), the SAE on an actual transformer indeed has higher correlation.)
Another metric is: comparing the similarity between two dictionaries using mean max cosine similarity (where one of the dictionaries is treated as the ground truth), we’ve found that two dictionaries trained from different random seeds on the same (non-randomized) model are highly similar (>.95), whereas dictionaries trained on a randomized model and an non-randomized model are dissimilar (<.3 IIRC, but I don’t have the data on hand).