This reminded me of how GPT-2-small uses a cosine/sine spiral for its learned positional embeddings embeddings, and I don’t think I’ve seen a mechanistic/dynamical explanation for this (just the post-hoc explanation that attention can use cosine similarity to encode distance in R^n, not that it should happen this way).
Yeah this does seem like its another good example of what I’m trying to gesture at. More generally, I think the embedding at layer 0 is a good place for thinking about the kind of structure that the superposition hypothesis is blind to. If the vocab size is smaller than the SAE dictionary size, an SAE is likely to get perfect reconstruction and L0=1 by just learning the vocab_size many embeddings. But those embeddings aren’t random! They have been carefully learned and contain lots of useful information. I think trying to explain the structure in the embeddings is a good testbed for explaining general feature geometry.
This reminded me of how GPT-2-small uses a cosine/sine spiral for its learned positional embeddings embeddings, and I don’t think I’ve seen a mechanistic/dynamical explanation for this (just the post-hoc explanation that attention can use cosine similarity to encode distance in R^n, not that it should happen this way).
Yeah this does seem like its another good example of what I’m trying to gesture at. More generally, I think the embedding at layer 0 is a good place for thinking about the kind of structure that the superposition hypothesis is blind to. If the vocab size is smaller than the SAE dictionary size, an SAE is likely to get perfect reconstruction and L0=1 by just learning the vocab_size many embeddings. But those embeddings aren’t random! They have been carefully learned and contain lots of useful information. I think trying to explain the structure in the embeddings is a good testbed for explaining general feature geometry.