jake_mendel comments on SAE feature geometry is outside the superposition hypothesis

jake_mendel 24 Jun 2024 19:17 UTC
14 points
2
Yeah this does seem like its another good example of what I’m trying to gesture at. More generally, I think the embedding at layer 0 is a good place for thinking about the kind of structure that the superposition hypothesis is blind to. If the vocab size is smaller than the SAE dictionary size, an SAE is likely to get perfect reconstruction and $L_{0} = 1$ by just learning the vocab_size many embeddings. But those embeddings aren’t random! They have been carefully learned and contain lots of useful information. I think trying to explain the structure in the embeddings is a good testbed for explaining general feature geometry.