Lee Sharkey comments on [Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey 23 Feb 2023 21:04 UTC
LW: 1 AF: 1
0
AF
Thanks for your interest!

The autoencoder losses reported are the train losses. And you’re right to point at noise potentially being an issue. It’s my strong suspicion that some of the problems in these results are due to there being an insufficient number of data points to train the autoencoders on LM data.

> I would also be interested to test a bit more if this method works on toy models which clearly don’t have many features, such as a mixture of a dozen of gaussians, or random points in the unit square (where there is a lot of room “in the corners”), to see if this method produces strong false positives.

I’d be curious to see these results too!

> Layer 0 is also a baseline, since I expect embeddings to have fewer features than activations in later layers, though I’m not sure how many features you should expect in layer 0.

A rough estimate would be somewhere on the order of the vocabulary size (here 50k). A reason to think it might be more is that layer 0 MLP activations follow an attention layer, which means that features may represent combinations of token embeddings at different sequence positions and there are more potential combinations of tokens than in the vocabulary. A reason to think it may be fewer is that a lot of directions may get ‘compressed away’ in small networks.