[Proposal] Do SAEs learn universal features? Measuring Equivalence between SAE checkpoints
If we train several SAEs from scratch on the same set of model activations, are they “equivalent”?
Here are two notions of “equivalence:
Direct equivalence. Features in one SAE are the same (in terms of decoder weight) as features in another SAE.
Linear equivalence. Features in one SAE directly correspond one-to-one with features in another SAE after some global transformation like rotation.
Functional equivalence. The SAEs define the same input-output mapping.
A priori, I would expect that we get rough functional equivalence, but not feature equivalence. I think this experiment would help elucidate the underlying invariant geometrical structure that SAE features are suspected to be in.
Changelog:
18/07/2024 - Added discussion on “linear equivalence
Found this graph on the old sparse_coding channel on the eleuther discord:
Logan Riggs: For MCS across dicts of different sizes (as a baseline that’s better, but not as good as dicts of same size/diff init). Notably layer 5 is sucks. Also, layer 2 was trained differently than the others, but I don’t have the hyperparams or amount of training data on hand.
So at least tentatively that looks like “most features in a small SAE correspond one-to-one with features in a larger SAE trained on the activations of the same model on the same data”.
Yeah, stands for Max Cosine Similarity. Cosine similarity is a pretty standard measure for how close two vectors are to pointing in the same direction. It’s the cosine of the angle between the two vectors, so +1.0 means the vectors are pointing in exactly the same direction, 0.0 means the vectors are orthogonal, −1.0 means the vectors are pointing in exactly opposite directions.
To generate this graph, I think he took each of the learned features in the smaller dictionary, and then calculated to cosine similarity of that small-dictionary feature with every feature in the larger dictionary, and then the maximal cosine similarity was the MCS for that small-dictionary feature. I have a vague memory of him also doing some fancy linear_sum_assignment() thing (to ensure that each feature in the large dictionary could only be used once in order avoid having multiple features in the small dictionary have their MCS come from the same feature on the large dictionary) though IIRC it didn’t actually matter.
Also I think the small and large dictionaries were trained using different methods as each other for layer 2, and this was on pythia-70m-deduped so layer 5 was the final layer immediately before unembedding (so naively I’d expect most of the “features” to just be “the output token will be the” or “the output token will be when” etc).
Edit: In terms of “how to interpret these graphs”, they’re histograms with the horizontal axis being bins of cosine similarity, and the vertical axis being how many small-dictionary features had a the cosine similarity with a large-dictionary feature within that bucket. So you can see at layer 3 it looks like somewhere around half of the small dictionary features had a cosine similarity of 0.96-1.0 with one of the large dictionary features, and almost all of them had a cosine similarity of at least 0.8 with the best large-dictionary feature.
Which I read as “large dictionaries find basically the same features as small ones, plus some new ones”.
Bear in mind also that these were some fairly small dictionaries. I think these charts were generated with this notebook so I think smaller_dict was of size 2048 and larger_dict was size 4096 (with a residual width of 512, so 4x and 8x respectively). Anthropic went all the way to 256x residual width with their “Towards Monosemanticity” paper later that year, and the behavior might have changed at that scale.
If we train several SAEs from scratch on the same set of model activations, are they “equivalent”?
For SAEs of different sizes, for most layers, the smaller SAE does contain very high similarity with some of the larger SAE features, but it’s not always true. I’m working on an upcoming post on this.
[Proposal] Do SAEs learn universal features? Measuring Equivalence between SAE checkpoints
If we train several SAEs from scratch on the same set of model activations, are they “equivalent”?
Here are two notions of “equivalence:
Direct equivalence. Features in one SAE are the same (in terms of decoder weight) as features in another SAE.
Linear equivalence. Features in one SAE directly correspond one-to-one with features in another SAE after some global transformation like rotation.
Functional equivalence. The SAEs define the same input-output mapping.
A priori, I would expect that we get rough functional equivalence, but not feature equivalence. I think this experiment would help elucidate the underlying invariant geometrical structure that SAE features are suspected to be in.
Changelog:
18/07/2024 - Added discussion on “linear equivalence
Found this graph on the old sparse_coding channel on the eleuther discord:
So at least tentatively that looks like “most features in a small SAE correspond one-to-one with features in a larger SAE trained on the activations of the same model on the same data”.
Oh that’s really interesting! Can you clarify what “MCS” means? And can you elaborate a bit on how I’m supposed to interpret these graphs?
Yeah, stands for Max Cosine Similarity. Cosine similarity is a pretty standard measure for how close two vectors are to pointing in the same direction. It’s the cosine of the angle between the two vectors, so +1.0 means the vectors are pointing in exactly the same direction, 0.0 means the vectors are orthogonal, −1.0 means the vectors are pointing in exactly opposite directions.
To generate this graph, I think he took each of the learned features in the smaller dictionary, and then calculated to cosine similarity of that small-dictionary feature with every feature in the larger dictionary, and then the maximal cosine similarity was the MCS for that small-dictionary feature. I have a vague memory of him also doing some fancy
linear_sum_assignment()
thing (to ensure that each feature in the large dictionary could only be used once in order avoid having multiple features in the small dictionary have their MCS come from the same feature on the large dictionary) though IIRC it didn’t actually matter.Also I think the small and large dictionaries were trained using different methods as each other for layer 2, and this was on pythia-70m-deduped so layer 5 was the final layer immediately before unembedding (so naively I’d expect most of the “features” to just be “the output token will be
the
” or “the output token will bewhen
” etc).Edit: In terms of “how to interpret these graphs”, they’re histograms with the horizontal axis being bins of cosine similarity, and the vertical axis being how many small-dictionary features had a the cosine similarity with a large-dictionary feature within that bucket. So you can see at layer 3 it looks like somewhere around half of the small dictionary features had a cosine similarity of 0.96-1.0 with one of the large dictionary features, and almost all of them had a cosine similarity of at least 0.8 with the best large-dictionary feature.
Which I read as “large dictionaries find basically the same features as small ones, plus some new ones”.
Bear in mind also that these were some fairly small dictionaries. I think these charts were generated with this notebook so I think
smaller_dict
was of size 2048 andlarger_dict
was size 4096 (with a residual width of 512, so 4x and 8x respectively). Anthropic went all the way to 256x residual width with their “Towards Monosemanticity” paper later that year, and the behavior might have changed at that scale.For SAEs of different sizes, for most layers, the smaller SAE does contain very high similarity with some of the larger SAE features, but it’s not always true. I’m working on an upcoming post on this.
Interesting, we find that all features in a smaller SAE have a feature in a larger SAE with cosine similarity > 0.7, but not all features in a larger SAE have a close relative in a smaller SAE (but about ~65% do have a close equavalent at 2x scale up).