Did you try getting the centroid of all words, rather than all tokens? The set of token will contain a lot of nonsense fragments.
No, would be interesting to try. Someone somewhere might have compiled a list of indexes for GPT-2/3/J tokens which are full words, but I’ve not yet been able to find one.
Did you try getting the centroid of all words, rather than all tokens? The set of token will contain a lot of nonsense fragments.
No, would be interesting to try. Someone somewhere might have compiled a list of indexes for GPT-2/3/J tokens which are full words, but I’ve not yet been able to find one.