There were similar results in the mid 2010s about how the principle components of word vectors like Word2Vec or Glove mainly encoded frequency, and improving them by making the word vectors more isotropic (see for example the slides here). It’s somewhat interesting that this issue persists in the learned embeddings of current Transformer models.
Huh, interesting!
There were similar results in the mid 2010s about how the principle components of word vectors like Word2Vec or Glove mainly encoded frequency, and improving them by making the word vectors more isotropic (see for example the slides here). It’s somewhat interesting that this issue persists in the learned embeddings of current Transformer models.