I think you probably could do that, but you’d be restricting yourself to something that might work marginally worse than whatever would otherwise be found by gradient descent. Also, the more important part of the 768 dimensional vector which actually gets processed is the token embeddings.
If you believe that neural nets store things as directions, one way to think of this is as the neural net reserving 3 dimensions for positional information, and 765 for the semantic content of the tokens. If the actual meaning of the words you read is roughly 250 times as important to your interpretation of a sentence as where they come in a sentence, then this should make sense?
This is kinda a silly way of looking at it—we don’t have any reason (that I’m aware of) to think of these as separable, the interactions probably matter a lot—but might be not-totally-worthless as intuition.
@AdamYedidia This is super cool stuff! Is the magnitude of the token embeddings at all concentrated in or out of the 3 PCA dimensions for the positional embeddings? If its concentrated away from that, we are practically using the addition as a direct sum, which is nifty.
You can skip most of the setup in the README if you just want to reproduce the experiment (there’s a lot of other stuff going on the repository, but you’ll still need to install TransformerLens, sklearn, numpy, etc.
I think you probably could do that, but you’d be restricting yourself to something that might work marginally worse than whatever would otherwise be found by gradient descent. Also, the more important part of the 768 dimensional vector which actually gets processed is the token embeddings.
If you believe that neural nets store things as directions, one way to think of this is as the neural net reserving 3 dimensions for positional information, and 765 for the semantic content of the tokens. If the actual meaning of the words you read is roughly 250 times as important to your interpretation of a sentence as where they come in a sentence, then this should make sense?
This is kinda a silly way of looking at it—we don’t have any reason (that I’m aware of) to think of these as separable, the interactions probably matter a lot—but might be not-totally-worthless as intuition.
@AdamYedidia This is super cool stuff! Is the magnitude of the token embeddings at all concentrated in or out of the 3 PCA dimensions for the positional embeddings? If its concentrated away from that, we are practically using the addition as a direct sum, which is nifty.
It is in fact concentrated away from that, as you predicted! Here’s a cool scatter plot:
The blue points are the positional embeddings for gpt2-small, whereas the red points are the token embeddings.
If you want to play around with it yourself, you can find it in the experiments/ directory in the following github: https://github.com/adamyedidia/resid_viewer.
You can skip most of the setup in the README if you just want to reproduce the experiment (there’s a lot of other stuff going on the repository, but you’ll still need to install TransformerLens, sklearn, numpy, etc.