Nathan Helm-Burger comments on A New Class of Glitch Tokens—BPE Subtoken Artifacts (BSA)

Nathan Helm-Burger 20 Sep 2024 16:27 UTC
5 points
0
Right, but… it’s a combined issue of tokenization and data cleaning. I agree you can’t properly do one without the other. I just think you should do both!
Clearly the root cause is the messy data, and once you’ve cleaned the data it should be trivial to fix the tokenizer.
I checked Lunary’s predicted tokenization for Gemma, Mistral, and Grok as well. These three models all seem to handle number tokenization cleanly, a single digit at a time. This suggests that there isn’t some well considered valid reason that digits need uneven lumpy tokenization, that it’s really just a mistake.