There is likely a reason for this—if you feed in numbers you found on the internet into a LLM digit by digit, it’s going to destroy the embeddings of those numbers. A lot of things found in scrapes are just… extremely long sequences of numbers. The tradeoff may be numeracy (can do basic multiplication) vs natural language performance (won’t start spitting out Minecraft debug logs in the middle of conversation).
Right, but… it’s a combined issue of tokenization and data cleaning. I agree you can’t properly do one without the other. I just think you should do both!
Clearly the root cause is the messy data, and once you’ve cleaned the data it should be trivial to fix the tokenizer.
I checked Lunary’s predicted tokenization for Gemma, Mistral, and Grok as well. These three models all seem to handle number tokenization cleanly, a single digit at a time. This suggests that there isn’t some well considered valid reason that digits need uneven lumpy tokenization, that it’s really just a mistake.
There is likely a reason for this—if you feed in numbers you found on the internet into a LLM digit by digit, it’s going to destroy the embeddings of those numbers. A lot of things found in scrapes are just… extremely long sequences of numbers. The tradeoff may be numeracy (can do basic multiplication) vs natural language performance (won’t start spitting out Minecraft debug logs in the middle of conversation).
Right, but… it’s a combined issue of tokenization and data cleaning. I agree you can’t properly do one without the other. I just think you should do both!
Clearly the root cause is the messy data, and once you’ve cleaned the data it should be trivial to fix the tokenizer.
I checked Lunary’s predicted tokenization for Gemma, Mistral, and Grok as well. These three models all seem to handle number tokenization cleanly, a single digit at a time. This suggests that there isn’t some well considered valid reason that digits need uneven lumpy tokenization, that it’s really just a mistake.