LLMs appear to universally learn a feature in their embeddings representing the frequency / rarity of the tokens they were trained on
...I use Olah et al.’s definition of a feature as a direction in the vector space of a model’s weights / activations in a given layer.
If you define ‘feature’ this way, and you look only at post-training models, including your nanoGPT models, is it necessarily learned? Or are they something else, like left near their initialization and then it’s every other feature which is learned and so the ‘rareness’ feature is not really a feature, so much as it is ‘the absence of every other feature’, maybe?
It seems like, almost by definition from gradient descent, if tokens are rarely or never present in the training (as seems to be the case with a lot of the glitch tokens like SolidGoldMagikarp, especially the spam ones, like all the Chinese porn ones that somehow got included in the new GPT-4o tokenization), it is difficult for a model to learn anything about rare tokens—they simply never contribute to the output because they are never present, and there are no gradients. And then activation steering or other things using the ‘rareness feature’ would be weak or not do anything particularly consistent or interesting, because what is left is going to be effectively meaningless jitter from tiny numerical residues.
Very true. If a token truly never appears in the training data, it wouldn’t be trained / learned at all. Or similarly, if it’s only seen like once or twice it ends up “undertrained” and the token frequency feature doesn’t perform as well on it. The two least frequent tokens in the nanoGPT model are an excellent example of this. They appear like only once or twice and as a result don’t get properly learned, and as a result end up being big outliers.
If you define ‘feature’ this way, and you look only at post-training models, including your nanoGPT models, is it necessarily learned? Or are they something else, like left near their initialization and then it’s every other feature which is learned and so the ‘rareness’ feature is not really a feature, so much as it is ‘the absence of every other feature’, maybe?
It seems like, almost by definition from gradient descent, if tokens are rarely or never present in the training (as seems to be the case with a lot of the glitch tokens like SolidGoldMagikarp, especially the spam ones, like all the Chinese porn ones that somehow got included in the new GPT-4o tokenization), it is difficult for a model to learn anything about rare tokens—they simply never contribute to the output because they are never present, and there are no gradients. And then activation steering or other things using the ‘rareness feature’ would be weak or not do anything particularly consistent or interesting, because what is left is going to be effectively meaningless jitter from tiny numerical residues.
Very true. If a token truly never appears in the training data, it wouldn’t be trained / learned at all. Or similarly, if it’s only seen like once or twice it ends up “undertrained” and the token frequency feature doesn’t perform as well on it. The two least frequent tokens in the nanoGPT model are an excellent example of this. They appear like only once or twice and as a result don’t get properly learned, and as a result end up being big outliers.