I wonder if the reason MLPs are more polysemantic isn’t because there are fewer MLPs than heads but because the MLP matrices are larger—
Suppose the model is storing information as sparse [rays or directions]. Then SVD on large matrices like the token embeddings can misunderstand the model in different ways:
- Many of the sparse rays/directions won’t be picked up by SVD. If there are 10,000 rays/directions used by the model and the model dimension is 768, SVD can only pick 768 directions. - If the model natively stores information as rays, then SVD is looking for the wrong thing: directions instead of rays. If you think of SVD as a greedy search for the most important directions, the error might increase as the importance of the direction decreases. - Because the model is storing things sparsely, it can squeeze in far more meaningful directions than the model dimension. But these directions can’t be perfectly orthogonal, they have to interfere with each other at least a bit. This noise could make SVD with large matrices worse and also means that the assumptions involved in SVD are wrong.
As evidence for the above story, I notice that the earliest PCA directions on the token embeddings are interpretable, but they quickly become less interpretable?
Maybe because the QK/OV matrices have low rank they specialize in a small number of the sparse directions (possibly greater than their rank) and have less interference noise. These could contribute to interpretability of SVD directions.
You might expect in this world that the QK/OV SVD directions would be more interpretable than the MLP matrices which would in turn be more interpretable than the token embedding SVD.
This seems like an important but I am not sure I completely follow. How do rays differ from directions here? I agree that the SVD directions won’t recover any JL kind of dense packing of directions since it is constrained to, at maximum, the dimension of the matrix. The thinking here is then that if the model tends to pack semantically similar directions into closely related dimensions, then the SVD would pick up on at least an average of this and represent it.
I also think something to keep in mind is that we are doing the SVDs over the OV and MLP weights and not activations. That is, these are the directions in which the weight matrix is most strongly stretching the activation space. We don’t necessarily expect the weight matrix to be doing its own JL packing, I don’t think. I also think that it is reasonable that the SVD would find sensible directions here. It is of course possible that the network isn’t relying on the principal svd directions for it’s true ‘semantic’ processing but that it performs the stretching/compressing with some intermediate direction comprised of multiple SVD directions and we can’t rule that out with this method.
I really appreciate this work!
I wonder if the reason MLPs are more polysemantic isn’t because there are fewer MLPs than heads but because the MLP matrices are larger—
Suppose the model is storing information as sparse [rays or directions]. Then SVD on large matrices like the token embeddings can misunderstand the model in different ways:
- Many of the sparse rays/directions won’t be picked up by SVD. If there are 10,000 rays/directions used by the model and the model dimension is 768, SVD can only pick 768 directions.
- If the model natively stores information as rays, then SVD is looking for the wrong thing: directions instead of rays. If you think of SVD as a greedy search for the most important directions, the error might increase as the importance of the direction decreases.
- Because the model is storing things sparsely, it can squeeze in far more meaningful directions than the model dimension. But these directions can’t be perfectly orthogonal, they have to interfere with each other at least a bit. This noise could make SVD with large matrices worse and also means that the assumptions involved in SVD are wrong.
As evidence for the above story, I notice that the earliest PCA directions on the token embeddings are interpretable, but they quickly become less interpretable?
Maybe because the QK/OV matrices have low rank they specialize in a small number of the sparse directions (possibly greater than their rank) and have less interference noise. These could contribute to interpretability of SVD directions.
You might expect in this world that the QK/OV SVD directions would be more interpretable than the MLP matrices which would in turn be more interpretable than the token embedding SVD.
This seems like an important but I am not sure I completely follow. How do rays differ from directions here? I agree that the SVD directions won’t recover any JL kind of dense packing of directions since it is constrained to, at maximum, the dimension of the matrix. The thinking here is then that if the model tends to pack semantically similar directions into closely related dimensions, then the SVD would pick up on at least an average of this and represent it.
I also think something to keep in mind is that we are doing the SVDs over the OV and MLP weights and not activations. That is, these are the directions in which the weight matrix is most strongly stretching the activation space. We don’t necessarily expect the weight matrix to be doing its own JL packing, I don’t think. I also think that it is reasonable that the SVD would find sensible directions here. It is of course possible that the network isn’t relying on the principal svd directions for it’s true ‘semantic’ processing but that it performs the stretching/compressing with some intermediate direction comprised of multiple SVD directions and we can’t rule that out with this method.