Slightly off tangent, but I am confused about the reasons and assumptions that underpin the current tokenizer used for GPT-3.
I get that reality has more words than could be packed into 50400 tokens (and that limit comes from hardware).
I also get that the token space needs to be big, so you can’t just go to character level tokenization, you would end up with a space that it too small.
But why on earth did the tokens end up like this? A lot of them look like garbage, a lot of them look like repeats of the same word, but with added white space or unprintable characters.
Surely there is some middle ground that better matches the reality of how me (humans) use words—And I think the confusing part for me is here, I mean why wouldn’t we construct a map that looks a lot like the features we see in the territory ? (really a map builder and not a map).
You’re overthinking it. OA BPEs are merely a quick hack by Radford or someone back in 2017 or so. They spent 5 seconds thinking about it: “I need to tokenize somehow, my Transformer has a context window of like 512 so character-based is out, word-level is too inflexible for multilingual multitask training like I hope the model will learn to do and would lead to lots of UNKs, so the only thing off-the-shelf in a convenient library on Github is BPEs; there, trained it on my dump of uncleaned Internet garbage data, done! Now on to something that actually matters...” No one was sitting down and debating the merits of BPEs vis Unigram-LM or BPE-dropout or whether it’d sabotage poetry. (BPEs are not optimal in any sense for anything. They don’t even guarantee optimality for what they do, as they just create tokens greedily, IIRC. And there are many viable choices: you can have a few thousand BPEs or you can push it to several hundred thousand like Jurassic, or to a million like FB recently did. You can use wordpieces or character-encoding (ByT5), you can expand the vocab & redo on specialized corpuses like OA did with Github for Codex, or their new c100k tokenization etc.) It was just one of innumerable minor engineering decisions made along the way. It was never supposed to be important or still matter 6+ years later because the models turned out to be so important & worth keeping backwards-compatibility for. And they definitely weren’t thinking about anything remotely like unspeakability.
Sure. I listed a bunch of improvements right there. They just tend to always be relatively small on economically-important usecases, compared to other things like RLHF. (Sadly, no matter how much I whine about ChatGPT’s poetry being super basic and bland because of BPEs+mode-collapse, better poetry wouldn’t make OA much money compared to working harder on RL tuning to prioritize Q&A or reasoning or investing in more GPUs to serve more users.)
I think so. If someone could show that BPEs were changing the scaling laws on an important task end-users will pay for, then it wouldn’t be hard to change that: for example, I noted that Codex induced OA to change BPEs, because that substantially increased the effective context window when you generate BPEs optimized for programming language syntax, which matters to big paying customers like Github (the larger the ctx, the more the variables & definitions inside a specific project are available for relevant completion). Otherwise, the general attitude seems to be to shrug and it’ll fix itself at some point when GPT-4 or GPT-5 or god knows what system uses some new tokenization or byte-level encoding or a fancy new attention/history mechanism with near-unlimited ctx motivated by other concerns and then the problems go away and become a minor historical footnote. And they are probably right to, as much as it annoys me to see the bad poetry or see people running into blatantly BPE-caused problems and declare ‘deep learning has hit a wall!’...
BPEs are one of the simplest schemes for producing a large, roughly-fairly-weighted-by-frequency set of tokens that compresses arbitrary bytes drawn from a written language training dataset. That’s about all you need to explain things in ML, typically.
Subword tokenization, the linguistically-guided pre-LLM approach, has a history but is comparatively complex, and I don’t think it compresses as well for a given token budget even on fairly normal-looking text.
Slightly off tangent, but I am confused about the reasons and assumptions that underpin the current tokenizer used for GPT-3.
I get that reality has more words than could be packed into 50400 tokens (and that limit comes from hardware).
I also get that the token space needs to be big, so you can’t just go to character level tokenization, you would end up with a space that it too small.
But why on earth did the tokens end up like this? A lot of them look like garbage, a lot of them look like repeats of the same word, but with added white space or unprintable characters.
Surely there is some middle ground that better matches the reality of how me (humans) use words—And I think the confusing part for me is here, I mean why wouldn’t we construct a map that looks a lot like the features we see in the territory ? (really a map builder and not a map).
Confused I am, knowledge I seek.
You’re overthinking it. OA BPEs are merely a quick hack by Radford or someone back in 2017 or so. They spent 5 seconds thinking about it: “I need to tokenize somehow, my Transformer has a context window of like 512 so character-based is out, word-level is too inflexible for multilingual multitask training like I hope the model will learn to do and would lead to lots of UNKs, so the only thing off-the-shelf in a convenient library on Github is BPEs; there, trained it on my dump of uncleaned Internet garbage data, done! Now on to something that actually matters...” No one was sitting down and debating the merits of BPEs vis Unigram-LM or BPE-dropout or whether it’d sabotage poetry. (BPEs are not optimal in any sense for anything. They don’t even guarantee optimality for what they do, as they just create tokens greedily, IIRC. And there are many viable choices: you can have a few thousand BPEs or you can push it to several hundred thousand like Jurassic, or to a million like FB recently did. You can use wordpieces or character-encoding (ByT5), you can expand the vocab & redo on specialized corpuses like OA did with Github for Codex, or their new
c100k
tokenization etc.) It was just one of innumerable minor engineering decisions made along the way. It was never supposed to be important or still matter 6+ years later because the models turned out to be so important & worth keeping backwards-compatibility for. And they definitely weren’t thinking about anything remotely like unspeakability.Are there any serious efforts to fix these quick hacks?
Sure. I listed a bunch of improvements right there. They just tend to always be relatively small on economically-important usecases, compared to other things like RLHF. (Sadly, no matter how much I whine about ChatGPT’s poetry being super basic and bland because of BPEs+mode-collapse, better poetry wouldn’t make OA much money compared to working harder on RL tuning to prioritize Q&A or reasoning or investing in more GPUs to serve more users.)
Ah that makes sense, so the fixes are ready but waiting for a compelling argument?
I think so. If someone could show that BPEs were changing the scaling laws on an important task end-users will pay for, then it wouldn’t be hard to change that: for example, I noted that Codex induced OA to change BPEs, because that substantially increased the effective context window when you generate BPEs optimized for programming language syntax, which matters to big paying customers like Github (the larger the ctx, the more the variables & definitions inside a specific project are available for relevant completion). Otherwise, the general attitude seems to be to shrug and it’ll fix itself at some point when GPT-4 or GPT-5 or god knows what system uses some new tokenization or byte-level encoding or a fancy new attention/history mechanism with near-unlimited ctx motivated by other concerns and then the problems go away and become a minor historical footnote. And they are probably right to, as much as it annoys me to see the bad poetry or see people running into blatantly BPE-caused problems and declare ‘deep learning has hit a wall!’...
Thanks for the insight.
BPEs are one of the simplest schemes for producing a large, roughly-fairly-weighted-by-frequency set of tokens that compresses arbitrary bytes drawn from a written language training dataset. That’s about all you need to explain things in ML, typically.
Subword tokenization, the linguistically-guided pre-LLM approach, has a history but is comparatively complex, and I don’t think it compresses as well for a given token budget even on fairly normal-looking text.
Thanks, turns out I was not as confused as I thought, I just needed to see the BPE algorithm.