Fabien Roger comments on [Interim research report] Taking features out of superposition with sparse autoencoders

Fabien Roger 21 Feb 2023 22:47 UTC
LW: 1 AF: 1
0
AF
I’m interested to know where this research will lead you!
A small detail: for experiments on LMs, did you measure the train or the test loss? I expect this to matter since I expect activations to be noisy, and I expect that overfitting noise can use many sparse features (except if the number of data points is extremely large relative to the number of parameters).
I would also be interested to test a bit more if this method works on toy models which clearly don’t have many features, such as a mixture of a dozen of gaussians, or random points in the unit square (where there is a lot of room “in the corners”), to see if this method produces strong false positives. Layer 0 is also a baseline, since I expect embeddings to have fewer features than activations in later layers, though I’m not sure how many features you should expect in layer 0.I hope you’ll find what’s wrong with layer 0 in your experiments!
- Lee Sharkey 23 Feb 2023 21:04 UTC
  LW: 1 AF: 1
  0
  AF Parent
  Thanks for your interest!
  
  The autoencoder losses reported are the train losses. And you’re right to point at noise potentially being an issue. It’s my strong suspicion that some of the problems in these results are due to there being an insufficient number of data points to train the autoencoders on LM data.
  
  > I would also be interested to test a bit more if this method works on toy models which clearly don’t have many features, such as a mixture of a dozen of gaussians, or random points in the unit square (where there is a lot of room “in the corners”), to see if this method produces strong false positives.
  
  I’d be curious to see these results too!
  
  > Layer 0 is also a baseline, since I expect embeddings to have fewer features than activations in later layers, though I’m not sure how many features you should expect in layer 0.
  
  A rough estimate would be somewhere on the order of the vocabulary size (here 50k). A reason to think it might be more is that layer 0 MLP activations follow an attention layer, which means that features may represent combinations of token embeddings at different sequence positions and there are more potential combinations of tokens than in the vocabulary. A reason to think it may be fewer is that a lot of directions may get ‘compressed away’ in small networks.