You’re really not going to like the fact that the GPT4o tokenizer has every single number below 1000 tokenized. It’s not a hand-crafted feature, since the token_ids are all over the place. I think they had to manually remove larger number tokens (there are none above 999).
I feel like I need a disclaimer like the South Park episode. This is what is actually inside the tokenizer.
They also have plenty of full width numbers (generally only used in Chinese and Japanese to not mess with spacing) and numbers in other languages in there.
Maybe they use a different tokenizer for math problems? Maybe the multi-digit number tokenizers are only used in places where there are a lot of id numbers? Nope. Looks like they were just raw-dogging it. If anyone is wondering why GPTs are so bad at basic multiplication, this is why.
If you’ve ever wondered “wow, why is GPT4o specifically better at math when the number of digits is divisible by 3?”, wonder no more. It’s the tokenizer. Again.
specifically better at math when the number of digits is divisible by 3?
Ironically, this is the opposite of the well-known calculating prodigy Shakuntala Devi, who said that inserting commas (e.g. “5,271,009”) interfered with her perception of the number.
You’re really not going to like the fact that the GPT4o tokenizer has every single number below 1000 tokenized. It’s not a hand-crafted feature, since the token_ids are all over the place. I think they had to manually remove larger number tokens (there are none above 999).
I feel like I need a disclaimer like the South Park episode. This is what is actually inside the tokenizer.
[’37779′, ‘740’],
[‘47572’, ‘741’],
[‘48725’, ‘742’],
[‘49191’, ‘743’],
[‘46240’, ‘744’],
[‘44839’, ‘745’],
[‘47433’, ‘746’],
[‘42870’, ‘747’],
[‘39478’, ‘748’],
[‘44712’, ‘749’],
They also have plenty of full width numbers (generally only used in Chinese and Japanese to not mess with spacing) and numbers in other languages in there.
[’14334′, ‘十’],
[‘96681’, ‘十一’],
[‘118633’, ‘十三’],
[‘138884’, ‘十九’],
[‘95270’, ‘十二’],
[‘119007’, ‘十五’],
[‘107205’, ‘十八’],
[‘180481’, ‘十四’]
[‘42624’, ‘零’],
[‘14053’, ‘0’],
[‘49300’, ‘00’],
[‘10888’, ‘1’],
[‘64980’, ‘10’],
[‘141681’, ‘100’],
[‘113512’, ‘11’],
[‘101137’, ‘12’],
[‘123326’, ‘13’],
[‘172589’, ‘14’],
[‘126115’, ‘15’],
[‘171221’, ‘16’]
Maybe they use a different tokenizer for math problems? Maybe the multi-digit number tokenizers are only used in places where there are a lot of id numbers? Nope. Looks like they were just raw-dogging it. If anyone is wondering why GPTs are so bad at basic multiplication, this is why.
Colin Fraser on X: “Here’s a similar experiment I just tried. The fact that this works even a little bit completely blows my mind and confuses me greatly. If you asked me if this would work at all I would say definitely not. https://t.co/E4knpf7JoZ” / X
If you’ve ever wondered “wow, why is GPT4o specifically better at math when the number of digits is divisible by 3?”, wonder no more. It’s the tokenizer. Again.
Ironically, this is the opposite of the well-known calculating prodigy Shakuntala Devi, who said that inserting commas (e.g. “5,271,009”) interfered with her perception of the number.
Interesting that GPT4o is so bad at math and tokenizes large numbers like this. I wonder if adding commas would improve performance?
🤢 But whyyyyyyyy?!