I think it’s very cool to play with token embeddings in this way! Note that some of what you observe is, I think, a consequence of geometry in high dimensions and can be understood by just modeling token embeddings as random. I recommend generating a bunch of tokens as a Gaussian random variable in a high-dimensional space and playing around with their norms and their norms after taking a random offset. Some things to keep in mind, that can be fun to check for some random vectors:
- radii of distributions in high-dimensional space tend to cluster around some fixed value. For a multivariate Gaussian in n-dimensional space, it’s because the square radius is a sum of squares of Gaussians (one for each coordinate). This is a random variable with mean O(n) and standard deviation O(√n). In your case, you’re also taking a square root (norm vs. square norm) and normalization is different, but the general pattern of this variable becoming narrow around a particular band (with width about 1/√n compared to the radius) will hold. - a random offset vector will not change the overall behavior (though it will change the radius). - Two random vectors in high-dimensional space will be nearly orthogonal.
On the other hand it’s unexpected that the mean is so large (normally you would expect the mean of a bunch of random vectors to be much smaller than the vectors themselves). If this is not an artifact of the training, it may indicate that words learn to be biased in some direction (maybe a direction indicating something like “a concept exists here”). The behavior of tokens near the center-of-mass also seems really interesting.
I think it’s very cool to play with token embeddings in this way! Note that some of what you observe is, I think, a consequence of geometry in high dimensions and can be understood by just modeling token embeddings as random. I recommend generating a bunch of tokens as a Gaussian random variable in a high-dimensional space and playing around with their norms and their norms after taking a random offset.
Some things to keep in mind, that can be fun to check for some random vectors:
- radii of distributions in high-dimensional space tend to cluster around some fixed value. For a multivariate Gaussian in n-dimensional space, it’s because the square radius is a sum of squares of Gaussians (one for each coordinate). This is a random variable with mean O(n) and standard deviation O(√n). In your case, you’re also taking a square root (norm vs. square norm) and normalization is different, but the general pattern of this variable becoming narrow around a particular band (with width about 1/√n compared to the radius) will hold.
- a random offset vector will not change the overall behavior (though it will change the radius).
- Two random vectors in high-dimensional space will be nearly orthogonal.
On the other hand it’s unexpected that the mean is so large (normally you would expect the mean of a bunch of random vectors to be much smaller than the vectors themselves). If this is not an artifact of the training, it may indicate that words learn to be biased in some direction (maybe a direction indicating something like “a concept exists here”). The behavior of tokens near the center-of-mass also seems really interesting.