Wow, just wow. Sure seems like Gwern has it spot on here that OpenAI engineers are being rushed and sloppy about this.
Kinda pisses me off, considering that I applied to work for OpenAI between GPT-2 and GPT-3, and this isn’t the sort of mistake I would make. Yet, they rejected my application. (note: given the current state of OpenAI, I wouldn’t apply today!)
Building my own small LLMs around that time for practice, the low frequency tokens and the tokenization of digits were among the first things I checked!
Lunary shows Anthropic as not only tokenizing groups of digits together, but also sometimes digits with a space in between!? Is that true? I’m flabbergasted if so. I’m going to look into this more.
[Edit: some people have suggested to me that they think Lunary is wrong about Anthropic’s tokenization, and that Anthropic never had or has fixed this problem. Hopefully that’s true. Still pretty surprising that OpenAI is still suffering from it!]
You’re really not going to like the fact that the GPT4o tokenizer has every single number below 1000 tokenized. It’s not a hand-crafted feature, since the token_ids are all over the place. I think they had to manually remove larger number tokens (there are none above 999).
I feel like I need a disclaimer like the South Park episode. This is what is actually inside the tokenizer.
They also have plenty of full width numbers (generally only used in Chinese and Japanese to not mess with spacing) and numbers in other languages in there.
Maybe they use a different tokenizer for math problems? Maybe the multi-digit number tokenizers are only used in places where there are a lot of id numbers? Nope. Looks like they were just raw-dogging it. If anyone is wondering why GPTs are so bad at basic multiplication, this is why.
If you’ve ever wondered “wow, why is GPT4o specifically better at math when the number of digits is divisible by 3?”, wonder no more. It’s the tokenizer. Again.
specifically better at math when the number of digits is divisible by 3?
Ironically, this is the opposite of the well-known calculating prodigy Shakuntala Devi, who said that inserting commas (e.g. “5,271,009”) interfered with her perception of the number.
There is likely a reason for this—if you feed in numbers you found on the internet into a LLM digit by digit, it’s going to destroy the embeddings of those numbers. A lot of things found in scrapes are just… extremely long sequences of numbers. The tradeoff may be numeracy (can do basic multiplication) vs natural language performance (won’t start spitting out Minecraft debug logs in the middle of conversation).
Right, but… it’s a combined issue of tokenization and data cleaning. I agree you can’t properly do one without the other. I just think you should do both!
Clearly the root cause is the messy data, and once you’ve cleaned the data it should be trivial to fix the tokenizer.
I checked Lunary’s predicted tokenization for Gemma, Mistral, and Grok as well. These three models all seem to handle number tokenization cleanly, a single digit at a time. This suggests that there isn’t some well considered valid reason that digits need uneven lumpy tokenization, that it’s really just a mistake.
Wow, just wow. Sure seems like Gwern has it spot on here that OpenAI engineers are being rushed and sloppy about this.
Kinda pisses me off, considering that I applied to work for OpenAI between GPT-2 and GPT-3, and this isn’t the sort of mistake I would make. Yet, they rejected my application. (note: given the current state of OpenAI, I wouldn’t apply today!)
Building my own small LLMs around that time for practice, the low frequency tokens and the tokenization of digits were among the first things I checked!
I tried a quick search for Anthropic to see if they are doing the same nonsense. Found this site: https://lunary.ai/anthropic-tokenizer And this https://github.com/javirandor/anthropic-tokenizer
Lunary shows Anthropic as not only tokenizing groups of digits together, but also sometimes digits with a space in between!? Is that true? I’m flabbergasted if so. I’m going to look into this more.
[Edit: some people have suggested to me that they think Lunary is wrong about Anthropic’s tokenization, and that Anthropic never had or has fixed this problem. Hopefully that’s true. Still pretty surprising that OpenAI is still suffering from it!]
You’re really not going to like the fact that the GPT4o tokenizer has every single number below 1000 tokenized. It’s not a hand-crafted feature, since the token_ids are all over the place. I think they had to manually remove larger number tokens (there are none above 999).
I feel like I need a disclaimer like the South Park episode. This is what is actually inside the tokenizer.
[’37779′, ‘740’],
[‘47572’, ‘741’],
[‘48725’, ‘742’],
[‘49191’, ‘743’],
[‘46240’, ‘744’],
[‘44839’, ‘745’],
[‘47433’, ‘746’],
[‘42870’, ‘747’],
[‘39478’, ‘748’],
[‘44712’, ‘749’],
They also have plenty of full width numbers (generally only used in Chinese and Japanese to not mess with spacing) and numbers in other languages in there.
[’14334′, ‘十’],
[‘96681’, ‘十一’],
[‘118633’, ‘十三’],
[‘138884’, ‘十九’],
[‘95270’, ‘十二’],
[‘119007’, ‘十五’],
[‘107205’, ‘十八’],
[‘180481’, ‘十四’]
[‘42624’, ‘零’],
[‘14053’, ‘0’],
[‘49300’, ‘00’],
[‘10888’, ‘1’],
[‘64980’, ‘10’],
[‘141681’, ‘100’],
[‘113512’, ‘11’],
[‘101137’, ‘12’],
[‘123326’, ‘13’],
[‘172589’, ‘14’],
[‘126115’, ‘15’],
[‘171221’, ‘16’]
Maybe they use a different tokenizer for math problems? Maybe the multi-digit number tokenizers are only used in places where there are a lot of id numbers? Nope. Looks like they were just raw-dogging it. If anyone is wondering why GPTs are so bad at basic multiplication, this is why.
Colin Fraser on X: “Here’s a similar experiment I just tried. The fact that this works even a little bit completely blows my mind and confuses me greatly. If you asked me if this would work at all I would say definitely not. https://t.co/E4knpf7JoZ” / X
If you’ve ever wondered “wow, why is GPT4o specifically better at math when the number of digits is divisible by 3?”, wonder no more. It’s the tokenizer. Again.
Ironically, this is the opposite of the well-known calculating prodigy Shakuntala Devi, who said that inserting commas (e.g. “5,271,009”) interfered with her perception of the number.
Interesting that GPT4o is so bad at math and tokenizes large numbers like this. I wonder if adding commas would improve performance?
🤢 But whyyyyyyyy?!
There is likely a reason for this—if you feed in numbers you found on the internet into a LLM digit by digit, it’s going to destroy the embeddings of those numbers. A lot of things found in scrapes are just… extremely long sequences of numbers. The tradeoff may be numeracy (can do basic multiplication) vs natural language performance (won’t start spitting out Minecraft debug logs in the middle of conversation).
Right, but… it’s a combined issue of tokenization and data cleaning. I agree you can’t properly do one without the other. I just think you should do both!
Clearly the root cause is the messy data, and once you’ve cleaned the data it should be trivial to fix the tokenizer.
I checked Lunary’s predicted tokenization for Gemma, Mistral, and Grok as well. These three models all seem to handle number tokenization cleanly, a single digit at a time. This suggests that there isn’t some well considered valid reason that digits need uneven lumpy tokenization, that it’s really just a mistake.