Tokens are embedded as vectors by the model. The vector space has fewer than 50k dimensions, so some token embeddings will overlap with others to varying extents.
Usually, the model tries to keep token embeddings from being too close to each other, but for rare enough tokens it doesn’t have much reason to care. So my bet is that “distribute” has the closest vector to “SolidGoldMagikarp”, and either has a vector with a larger norm, or the model has separately learned to map that vector (and therefore similar vectors) to “distribute” on the output side.
This is sort of a smooth continuous version of a collision-oblivious hashtable. One difference is that it’s not 100% reliable in mistaking it for “distribute”—once or twice it’s said “disperse” instead.
My post on GPT-2′s token embeddings looks briefly at a similar phenomenon with some other rare tokens, but I didn’t check the actual model behavior on those tokens. Probably worth doing.
Tokens are embedded as vectors by the model. The vector space has fewer than 50k dimensions, so some token embeddings will overlap with others to varying extents.
Usually, the model tries to keep token embeddings from being too close to each other, but for rare enough tokens it doesn’t have much reason to care. So my bet is that “distribute” has the closest vector to “SolidGoldMagikarp”, and either has a vector with a larger norm, or the model has separately learned to map that vector (and therefore similar vectors) to “distribute” on the output side.
This is sort of a smooth continuous version of a collision-oblivious hashtable. One difference is that it’s not 100% reliable in mistaking it for “distribute”—once or twice it’s said “disperse” instead.
My post on GPT-2′s token embeddings looks briefly at a similar phenomenon with some other rare tokens, but I didn’t check the actual model behavior on those tokens. Probably worth doing.