I’m pretty confused; this doesn’t seem to happen for any other models, and I can’t think of a great explanation. Has anyone investigated this further?
Here are graphs I made for GPT2, Mistral 7B, and Pythia 14M. 3 dimensions indeed explain almost all of the information in GPT’s positional embeddings, whereas Mistral 7B and Pythia 14M both seem to make use of all the dimensions.
I’m pretty confused; this doesn’t seem to happen for any other models, and I can’t think of a great explanation.
Has anyone investigated this further?
Here are graphs I made for GPT2, Mistral 7B, and Pythia 14M.
3 dimensions indeed explain almost all of the information in GPT’s positional embeddings, whereas Mistral 7B and Pythia 14M both seem to make use of all the dimensions.
Mistral and Pythia use rotary embeddings and don’t have a positional embedding matrix. Which matrix are you looking at for those two models?
Oh shoot, yea. I’m probably just looking at the rotary embeddings, then. Forgot about that, thanks