This is great work to pursue in order to establish how consistent the glitch-token phenomenon is. It will be interesting to see whether such glitch-tokens will arise in later LLMs now that developers have some theory of what might be giving rise to them (having frequent strings learned by the tokenizer that are then filtered out of the training data and depriving the LLM of opportunities to learn about those tokens).
Also, it will be interesting once we are able to run k-means clustering on GPT-3.5/4′s cl100k_base token base. While the hunch of searching towards the end of the token set makes sense as a heuristic, I’d bet that we are missing a lot of glitch-tokens, and possibly ones that are even more bizarre/ominous. Consider that some of the weirdest glitch-tokens in the GPT-2/3 token base don’t necessarily come from towards the end of the token list. ” petertodd”, for example, is token #37444, only about 75% of the way through the token list.
This is great work to pursue in order to establish how consistent the glitch-token phenomenon is. It will be interesting to see whether such glitch-tokens will arise in later LLMs now that developers have some theory of what might be giving rise to them (having frequent strings learned by the tokenizer that are then filtered out of the training data and depriving the LLM of opportunities to learn about those tokens).
Also, it will be interesting once we are able to run k-means clustering on GPT-3.5/4′s cl100k_base token base. While the hunch of searching towards the end of the token set makes sense as a heuristic, I’d bet that we are missing a lot of glitch-tokens, and possibly ones that are even more bizarre/ominous. Consider that some of the weirdest glitch-tokens in the GPT-2/3 token base don’t necessarily come from towards the end of the token list. ” petertodd”, for example, is token #37444, only about 75% of the way through the token list.