This looks like some kind of (rather bizarre) emergent/primitive ontology, radially stratified from the token embedding centroid.
A tentative thought on this… if we put our ‘superposition’ hats on.
We’re thinking of directions as mapping concepts or abstractions or whatnot. But there are too few strictly-orthogonal directions, so we need to cram things in somehow. It’s fashionable (IIUC) to imagine this happening radially but some kind of space partitioning (accounting for magnitudes as well) seems plausible to me.
Maybe closer to the centroid, there’s ‘less room’ for complicated taxonomies, so there are just some kinda ‘primitive’ abstractions which don’t have much refinement (perhaps at further distances there are taxonomic refinements of ‘metal’ and ‘sharp’). Then, the nearest conceptual-neighbour of small-magnitude random samples might tend to be one of these relatively ‘primitive’ concepts?
This might go some way to explaining why at close-to-centroid you’re getting these clustered ‘primitive’ concepts.
The ‘space partitioning’ vs ‘direction-based splitting’ could also explain the large-magnitude clusters (though it’s less clear why they’d be ‘primitive’). Clearly there’s some pressure (explicit regularisation or other) for most embeddings to sit in a particular shell. Taking that as given, there’s then little training pressure to finely partition the space ‘far outside’ that shell. So it maybe just happens to map to a relatively small number of concepts whose space includes the more outward reaches of the shell.
How to validate this sort of hypothesis? I’m not sure. It might be interesting to look for centroids, nearest neighbours, or something, of the apparent conceptual clusters that come out here. Or you could pay particular attention to the tokens with smallest and largest distance-to-centroid (there were long tails there).
You said “there are too few strictly-orthogonal directions, so we need to cram things in somehow.”
I don’t think that’s true. That is a low-dimensional intuition that does not translate to high dimensions. It may be “strictly” true if you want the vectors to be exactly orthogonal, but such perfect orthogonality is unnecessary. See e.g. papers that discuss “the linearity hypothesis’ in deep learning.
As a previous poster pointed out (and as Richard Hamming pointed out long ago) “almost any pair of random vectors in high-dimensional space are almost-orthogonal.” And almost orthogonal is good enough.
(when we say “random vectors in high dimensional space” we mean they can be drawn from any distribution roughly centered at the origin: Uniformly in a hyperball, or uniformly from the surface of a hypersphere, or uniformly in a hypercube, or random vertices from a hypercube, or drawn from a multivariate gaussian, or from a convex hyper-potato...)
You can check this numerically, and prove it analytically for many well-behaved distributions.
One useful thought is to consider the hypercube centered at the origin where all vertices coordinates are ±1. In that case a random hypercube vertex is a long random vector that look like {±1, ±1,… ±1} where each coordinate has a 50% probability of being +1 or −1 respectively.
What is the expected value of the dot product of a pair of such random (vertex) vectors? Their dot product is almost always close to zero.
There are an exponential number of almost-orthogonal directions in high dimensions. The hypercube vertices are just an easy example to work out analytically, but the same phenomenon occurs for many distributions. Particularly hyperballs, hyperspheres, and gaussians.
The hypercube example above, BTW, corresponds to one-bit quantization of the embedding vector space dimensions. It often works surprisingly well. (see also “locality sensitive hashing”).
This point that Hamming made (and he was probably not the first) lies close to the heart of all embedding-space-based learning systems.
A tentative thought on this… if we put our ‘superposition’ hats on.
We’re thinking of directions as mapping concepts or abstractions or whatnot. But there are too few strictly-orthogonal directions, so we need to cram things in somehow. It’s fashionable (IIUC) to imagine this happening radially but some kind of space partitioning (accounting for magnitudes as well) seems plausible to me.
Maybe closer to the centroid, there’s ‘less room’ for complicated taxonomies, so there are just some kinda ‘primitive’ abstractions which don’t have much refinement (perhaps at further distances there are taxonomic refinements of ‘metal’ and ‘sharp’). Then, the nearest conceptual-neighbour of small-magnitude random samples might tend to be one of these relatively ‘primitive’ concepts?
This might go some way to explaining why at close-to-centroid you’re getting these clustered ‘primitive’ concepts.
The ‘space partitioning’ vs ‘direction-based splitting’ could also explain the large-magnitude clusters (though it’s less clear why they’d be ‘primitive’). Clearly there’s some pressure (explicit regularisation or other) for most embeddings to sit in a particular shell. Taking that as given, there’s then little training pressure to finely partition the space ‘far outside’ that shell. So it maybe just happens to map to a relatively small number of concepts whose space includes the more outward reaches of the shell.
How to validate this sort of hypothesis? I’m not sure. It might be interesting to look for centroids, nearest neighbours, or something, of the apparent conceptual clusters that come out here. Or you could pay particular attention to the tokens with smallest and largest distance-to-centroid (there were long tails there).
You said “there are too few strictly-orthogonal directions, so we need to cram things in somehow.”
I don’t think that’s true. That is a low-dimensional intuition that does not translate to high dimensions. It may be “strictly” true if you want the vectors to be exactly orthogonal, but such perfect orthogonality is unnecessary. See e.g. papers that discuss “the linearity hypothesis’ in deep learning.
As a previous poster pointed out (and as Richard Hamming pointed out long ago) “almost any pair of random vectors in high-dimensional space are almost-orthogonal.” And almost orthogonal is good enough.
(when we say “random vectors in high dimensional space” we mean they can be drawn from any distribution roughly centered at the origin: Uniformly in a hyperball, or uniformly from the surface of a hypersphere, or uniformly in a hypercube, or random vertices from a hypercube, or drawn from a multivariate gaussian, or from a convex hyper-potato...)
You can check this numerically, and prove it analytically for many well-behaved distributions.
One useful thought is to consider the hypercube centered at the origin where all vertices coordinates are ±1. In that case a random hypercube vertex is a long random vector that look like {±1, ±1,… ±1} where each coordinate has a 50% probability of being +1 or −1 respectively.
What is the expected value of the dot product of a pair of such random (vertex) vectors? Their dot product is almost always close to zero.
There are an exponential number of almost-orthogonal directions in high dimensions. The hypercube vertices are just an easy example to work out analytically, but the same phenomenon occurs for many distributions. Particularly hyperballs, hyperspheres, and gaussians.
The hypercube example above, BTW, corresponds to one-bit quantization of the embedding vector space dimensions. It often works surprisingly well. (see also “locality sensitive hashing”).
This point that Hamming made (and he was probably not the first) lies close to the heart of all embedding-space-based learning systems.