If we train several SAEs from scratch on the same set of model activations, are they “equivalent”?
For SAEs of different sizes, for most layers, the smaller SAE does contain very high similarity with some of the larger SAE features, but it’s not always true. I’m working on an upcoming post on this.
For SAEs of different sizes, for most layers, the smaller SAE does contain very high similarity with some of the larger SAE features, but it’s not always true. I’m working on an upcoming post on this.
Interesting, we find that all features in a smaller SAE have a feature in a larger SAE with cosine similarity > 0.7, but not all features in a larger SAE have a close relative in a smaller SAE (but about ~65% do have a close equavalent at 2x scale up).