Yeah this does seem like its another good example of what I’m trying to gesture at. More generally, I think the embedding at layer 0 is a good place for thinking about the kind of structure that the superposition hypothesis is blind to. If the vocab size is smaller than the SAE dictionary size, an SAE is likely to get perfect reconstruction and L0=1 by just learning the vocab_size many embeddings. But those embeddings aren’t random! They have been carefully learned and contain lots of useful information. I think trying to explain the structure in the embeddings is a good testbed for explaining general feature geometry.
Yeah this does seem like its another good example of what I’m trying to gesture at. More generally, I think the embedding at layer 0 is a good place for thinking about the kind of structure that the superposition hypothesis is blind to. If the vocab size is smaller than the SAE dictionary size, an SAE is likely to get perfect reconstruction and L0=1 by just learning the vocab_size many embeddings. But those embeddings aren’t random! They have been carefully learned and contain lots of useful information. I think trying to explain the structure in the embeddings is a good testbed for explaining general feature geometry.