Seems to indicate that GPT-2 uses a byte-level BPE, though maybe the impl here is wrong, where I’d have expected it to use something closer to a word-by-wrod tokenizer with exceptions for rare words (i.e. a sub-word tokenizer that’s basically acting as a word tokenizer 90% of the time). And maybe GPT-3 uses the same?
Also it seems that sub-word tokenizer split much more aggressively than I’d have assumed before.
Hmh, ok, quick update to my knowledge that I should have done before: https://huggingface.co/transformers/tokenizer_summary.html
Seems to indicate that GPT-2 uses a byte-level BPE, though maybe the impl here is wrong, where I’d have expected it to use something closer to a word-by-wrod tokenizer with exceptions for rare words (i.e. a sub-word tokenizer that’s basically acting as a word tokenizer 90% of the time). And maybe GPT-3 uses the same?
Also it seems that sub-word tokenizer split much more aggressively than I’d have assumed before.
Complaint retracted.