I think this is most of what the layer 0 SAE gets wrong. The layer 0 SAE just reconstructs the activations after embedding (positional + token), so the only real explanation I see for what it’s getting wrong is the positional embedding.
But I’m less convinced that this explains later layer SAEs. If you look at e.g., this figure:
then you see that the layer 0 model activations are an order of magnitude smaller than any later-layer activations, so the positional embedding itself is only making up a really small part of the signal going into the SAE for any layer > 0 (so I’m skeptical that it’s accounting for a large fraction of the large MSE that shows up there).
Regardless, this seems like a really valuable test! It would be fun to see what happens if you just feed the token embedding into the SAE and then add in the positional embedding after reconstructing the token embedding. I’d naively assume that this would go poorly—if the SAE for layer 0 learns concepts more complex than just individual token embeddings, I think that would have to be the result of mixing positional and token embeddings?
My mental model is the encoder is working hard to find particular features and distinguish them from others (so it’s doing a compressed sensing task) and that out of context it’s off distribution and therefore doesn’t distinguish noise properly. Positional features are likely a part of that but I’d be surprised if it was most of it.
I think this is most of what the layer 0 SAE gets wrong. The layer 0 SAE just reconstructs the activations after embedding (positional + token), so the only real explanation I see for what it’s getting wrong is the positional embedding.
But I’m less convinced that this explains later layer SAEs. If you look at e.g., this figure:
then you see that the layer 0 model activations are an order of magnitude smaller than any later-layer activations, so the positional embedding itself is only making up a really small part of the signal going into the SAE for any layer > 0 (so I’m skeptical that it’s accounting for a large fraction of the large MSE that shows up there).
Regardless, this seems like a really valuable test! It would be fun to see what happens if you just feed the token embedding into the SAE and then add in the positional embedding after reconstructing the token embedding. I’d naively assume that this would go poorly—if the SAE for layer 0 learns concepts more complex than just individual token embeddings, I think that would have to be the result of mixing positional and token embeddings?
My mental model is the encoder is working hard to find particular features and distinguish them from others (so it’s doing a compressed sensing task) and that out of context it’s off distribution and therefore doesn’t distinguish noise properly. Positional features are likely a part of that but I’d be surprised if it was most of it.