Lao Mein comments on A New Class of Glitch Tokens—BPE Subtoken Artifacts (BSA)

Lao Mein 20 Sep 2024 14:31 UTC
12 points
1
You’re really not going to like the fact that the GPT4o tokenizer has every single number below 1000 tokenized. It’s not a hand-crafted feature, since the token_ids are all over the place. I think they had to manually remove larger number tokens (there are none above 999).
I feel like I need a disclaimer like the South Park episode. This is what is actually inside the tokenizer.
[’37779′, ‘740’],
[‘47572’, ‘741’],
[‘48725’, ‘742’],
[‘49191’, ‘743’],
[‘46240’, ‘744’],
[‘44839’, ‘745’],
[‘47433’, ‘746’],
[‘42870’, ‘747’],
[‘39478’, ‘748’],
[‘44712’, ‘749’],
They also have plenty of full width numbers (generally only used in Chinese and Japanese to not mess with spacing) and numbers in other languages in there.
[’14334′, ‘十’],
[‘96681’, ‘十一’],
[‘118633’, ‘十三’],
[‘138884’, ‘十九’],
[‘95270’, ‘十二’],
[‘119007’, ‘十五’],
[‘107205’, ‘十八’],
[‘180481’, ‘十四’]
[‘42624’, ‘零’],
[‘14053’, ‘０’],
[‘49300’, ‘００’],
[‘10888’, ‘１’],
[‘64980’, ‘１０’],
[‘141681’, ‘１００’],
[‘113512’, ‘１１’],
[‘101137’, ‘１２’],
[‘123326’, ‘１３’],
[‘172589’, ‘１４’],
[‘126115’, ‘１５’],
[‘171221’, ‘１６’]
Maybe they use a different tokenizer for math problems? Maybe the multi-digit number tokenizers are only used in places where there are a lot of id numbers? Nope. Looks like they were just raw-dogging it. If anyone is wondering why GPTs are so bad at basic multiplication, this is why.
Colin Fraser on X: “Here’s a similar experiment I just tried. The fact that this works even a little bit completely blows my mind and confuses me greatly. If you asked me if this would work at all I would say definitely not. https://t.co/E4knpf7JoZ” / X
If you’ve ever wondered “wow, why is GPT4o specifically better at math when the number of digits is divisible by 3?”, wonder no more. It’s the tokenizer. Again.
- Mitchell_Porter 20 Sep 2024 17:53 UTC
  3 points
  0
  Parent
  specifically better at math when the number of digits is divisible by 3?
  Ironically, this is the opposite of the well-known calculating prodigy Shakuntala Devi, who said that inserting commas (e.g. “5,271,009”) interfered with her perception of the number.
  - Lao Mein 21 Sep 2024 12:26 UTC
    4 points
    0
    Parent
    Interesting that GPT4o is so bad at math and tokenizes large numbers like this. I wonder if adding commas would improve performance?
- Nathan Helm-Burger 20 Sep 2024 14:54 UTC
  3 points
  0
  Parent
  🤢 But whyyyyyyyy?!