Lao Mein comments on A New Class of Glitch Tokens—BPE Subtoken Artifacts (BSA)

Lao Mein 20 Sep 2024 14:53 UTC
7 points
0
There is likely a reason for this—if you feed in numbers you found on the internet into a LLM digit by digit, it’s going to destroy the embeddings of those numbers. A lot of things found in scrapes are just… extremely long sequences of numbers. The tradeoff may be numeracy (can do basic multiplication) vs natural language performance (won’t start spitting out Minecraft debug logs in the middle of conversation).
- Nathan Helm-Burger 20 Sep 2024 16:27 UTC
  5 points
  0
  Parent
  Right, but… it’s a combined issue of tokenization and data cleaning. I agree you can’t properly do one without the other. I just think you should do both!
  Clearly the root cause is the messy data, and once you’ve cleaned the data it should be trivial to fix the tokenizer.
  I checked Lunary’s predicted tokenization for Gemma, Mistral, and Grok as well. These three models all seem to handle number tokenization cleanly, a single digit at a time. This suggests that there isn’t some well considered valid reason that digits need uneven lumpy tokenization, that it’s really just a mistake.