Rohin Shah comments on Showing SAE Latents Are Not Atomic Using Meta-SAEs

Rohin Shah Sep 22, 2024, 7:52 AM
LW: 7 AF: 5
0
AF
Suppose you trained a regular SAE in the normal way with a dictionary size of 2304. Do you expect the latents to be systematically different from the ones in your meta-SAE?
For example, here’s one systematic difference. The regular SAE is optimized to reconstruct activations uniformly sampled from your token dataset. The meta-SAE is optimized to reconstruct decoder vectors, which in turn were optimized to reconstruct activations from the token dataset—however, different decoder vectors have different frequencies of firing in the token dataset, so uniform over decoder vectors != uniform over token dataset. This means that, relative to the regular SAE, the meta-SAE will tend to have less precise / granular latents for concepts that occur frequently in the token dataset, and more precise / granular latents for concepts that occur rarely in the token dataset (but are frequent enough that they are represented in the set of decoder vectors).
It’s not totally clear which of these is “better” or more “fundamental”, though if you’re trying to optimize reconstructed loss, you should expect the regular SAE to do better based on this systematic difference.
(You could of course change the training for the meta-SAE to decrease this systematic difference, e.g. by sampling from the decoder vectors in proportion to their average magnitude over the token dataset, instead of sampling uniformly.)
- Neel Nanda Sep 22, 2024, 9:47 AM
  LW: 7 AF: 5
  0
  AF Parent
  Interesting thought! I expect there’s systematic differences, though it’s not quite obvious how. Your example seems pretty plausible to me. Meta SAEs are also more incentived to learn features which tend to split a lot, I think, as then they’re useful for more predicting many latents. Though ones that don’t split may be useful as they entirely explain a latent that’s otherwise hard to explain.
  
  Anyway, we haven’t checked yet, but I expect many of the results in this post would look similar for eg sparse linear regression over a smaller SAEs decoder. Re why meta SAEs are interesting at all, they’re much cheaper to train than a smaller SAE, and BatchTopK gives you more control over the L0 than you could easily get with sparse linear regression, which are some mild advantages, but you may have a small SAE lying around anyway. I see the interesting point of this post more as “SAE latents are not atomic, as shown by one method, but probably other methods would work well too”