GPT-3 (and most pretrained transformers) generate tokens, not words or characters. Sometimes, those tokens represent words and sometimes they represent single characters. More common words receive their own token, and less common words are broken into two or more tokens. The vocab is tuned to minimize avg. text length.
Seems to indicate that GPT-2 uses a byte-level BPE, though maybe the impl here is wrong, where I’d have expected it to use something closer to a word-by-wrod tokenizer with exceptions for rare words (i.e. a sub-word tokenizer that’s basically acting as a word tokenizer 90% of the time). And maybe GPT-3 uses the same?
Also it seems that sub-word tokenizer split much more aggressively than I’d have assumed before.
<retracted>
GPT-3 (and most pretrained transformers) generate tokens, not words or characters. Sometimes, those tokens represent words and sometimes they represent single characters. More common words receive their own token, and less common words are broken into two or more tokens. The vocab is tuned to minimize avg. text length.
Hmh, ok, quick update to my knowledge that I should have done before: https://huggingface.co/transformers/tokenizer_summary.html
Seems to indicate that GPT-2 uses a byte-level BPE, though maybe the impl here is wrong, where I’d have expected it to use something closer to a word-by-wrod tokenizer with exceptions for rare words (i.e. a sub-word tokenizer that’s basically acting as a word tokenizer 90% of the time). And maybe GPT-3 uses the same?
Also it seems that sub-word tokenizer split much more aggressively than I’d have assumed before.
Complaint retracted.