I don’t think we should consider the centroid important in describing the LLM’s “ontology”. In my view, the centroid just points in the direction of highest density of words in the LLM’s space of concepts. Let me explain:
The reason that embeddings are spread out is to allow the model to distinguish between words. So intuitively, tokens with largeish dot product between them correspond to similar words. Distinguishability of tokens is a limited resource, so the training process should generally result in a distribution of tokens that uses this resource in an efficient way to encode the information needed to predict text. Consider a language with 100 words for snow. Probably these all end up with similar token vectors, with large dot products between them. Exactly which word for snow someone writes is probably not too important for predicting text. So the training process makes those tokens relatively less distinguishable from each other. But the fact that there are a 100 tokens all pointing in a similar direction means that the centroid gets shifted in that direction.
Probably you can see where this is going now. The centroid gets shifted in directions where there are many tokens that the network considers to be all similar in meaning, directions where human language has allocated a lot of words, while the network considers the differences in shades of meaning between these words to be relatively minor.
I don’t think we should consider the centroid important in describing the LLM’s “ontology”. In my view, the centroid just points in the direction of highest density of words in the LLM’s space of concepts. Let me explain:
The reason that embeddings are spread out is to allow the model to distinguish between words. So intuitively, tokens with largeish dot product between them correspond to similar words. Distinguishability of tokens is a limited resource, so the training process should generally result in a distribution of tokens that uses this resource in an efficient way to encode the information needed to predict text. Consider a language with 100 words for snow. Probably these all end up with similar token vectors, with large dot products between them. Exactly which word for snow someone writes is probably not too important for predicting text. So the training process makes those tokens relatively less distinguishable from each other. But the fact that there are a 100 tokens all pointing in a similar direction means that the centroid gets shifted in that direction.
Probably you can see where this is going now. The centroid gets shifted in directions where there are many tokens that the network considers to be all similar in meaning, directions where human language has allocated a lot of words, while the network considers the differences in shades of meaning between these words to be relatively minor.
If there’s a 100 tokens for snow, it probably indicates that it’s a particularly important concept for that language.