Thanks!
One cheap and lazy approach is to see how many of your features have high cosine similarity with the features of an existing L1-trained SAE (e.g. “900 of the 2048 features detected by the -trained model had cosine sim > 0.9 with one of the 2048 features detected by the L1-trained model”).
I looked at the cosine sims between the L1-trained reference model and one of my SAEs presented above and found:
2501 out of 24576 (10%) of the features detected by the -trained model had cosine sim > 0.9 with one of the 24576 features detected by the L1-trained model.
7774 out of 24576 (32%) had cosine sim > 0.8
50% have cosine sim > 0.686
I’m not sure how to interpret these. Are they low/high? They appear to be roughly similar to if I compare between two of the -trained SAEs.
I’d also be interested to see individual examinations of some of the features which consistently appear across multiple training runs in the -trained model but don’t appear in an L1-trained SAE on the training dataset.
I think I’ll look more at this. Some summarised examples are shown in the response above.
Interesting, thanks for sharing! Are there specific existing ideas you think would be valuable for people to look at in the context of SAEs & language models, but that they are perhaps unaware of?