The number of problems that non-character/byte tokenization causes, whether BPE or WordPiece, never fails to amaze me. What a kettle of worms is that attractive-looking hack to save context window & speed up learning—especially as the models become so smart they otherwise make few errors & it becomes harder to shrug away tokenization pathologies.
Would be funny if hurdle presented by tokenization is somehow responsible for LLM being smarter than expected :) Sounds exactly like kind of curveball reality likes to throw at us from time to time :)
I definitely think that LLMs are ‘smarter than expected’ for many people due to tokenization, if only because they look at tokenization errors, which are so vivid and clear, and then ignore things like GPQA which are arcane and hard to read, and conclude LLMs are stupid. “It can’t even count the letters in ‘strawberry’, obviously this is all bunk.”
The number of problems that non-character/byte tokenization causes, whether BPE or WordPiece, never fails to amaze me. What a kettle of worms is that attractive-looking hack to save context window & speed up learning—especially as the models become so smart they otherwise make few errors & it becomes harder to shrug away tokenization pathologies.
Would be funny if hurdle presented by tokenization is somehow responsible for LLM being smarter than expected :) Sounds exactly like kind of curveball reality likes to throw at us from time to time :)
I definitely think that LLMs are ‘smarter than expected’ for many people due to tokenization, if only because they look at tokenization errors, which are so vivid and clear, and then ignore things like GPQA which are arcane and hard to read, and conclude LLMs are stupid. “It can’t even count the letters in ‘strawberry’, obviously this is all bunk.”