Looking into the details, BPEs seem to usually fall back to treating unknown characters as literally bytes: so there’s another 256 BPE which cover the 256 possible bytes, and then any UTF-8 character is 1-4 bytes, and so can be represented by 1-4 BPEs. The 1-byte UTF-8 characters are the ASCII characters, which have their own BPEs, so this would be used only for 2-4 byte-long UTF-8 characters like Cyrillic or Chinese.
So actually, now that I think about it, it’s possible that Russian gets encoded to worse than 1 BPE per character, it could be 2 BPEs (since Cyrillic seems to fall in the 2-byte ranges of UTF-8). It’d depend on the details. (While on the other hand, having to pay 2-4 BPEs per Unicode character is obviously not as big a deal for Japanese & Chinese characters...)
I wouldn’t expect the BPE to allocate much space to Cyrillic stuff because it’s the 5th most common script in the dataset, as that’s just another way of saying all the Russian put together is all of 0.18% of the dataset. And keep in mind that the BPE encoding was not, AFAIK, redone for GPT-3, but is the same BPE OA has been using ever since GPT-2 way back when, and so was optimized for their Reddit-sourced English-heavy original WebText.
Looking into the details, BPEs seem to usually fall back to treating unknown characters as literally bytes: so there’s another 256 BPE which cover the 256 possible bytes, and then any UTF-8 character is 1-4 bytes, and so can be represented by 1-4 BPEs. The 1-byte UTF-8 characters are the ASCII characters, which have their own BPEs, so this would be used only for 2-4 byte-long UTF-8 characters like Cyrillic or Chinese.
So actually, now that I think about it, it’s possible that Russian gets encoded to worse than 1 BPE per character, it could be 2 BPEs (since Cyrillic seems to fall in the 2-byte ranges of UTF-8). It’d depend on the details. (While on the other hand, having to pay 2-4 BPEs per Unicode character is obviously not as big a deal for Japanese & Chinese characters...)
I wouldn’t expect the BPE to allocate much space to Cyrillic stuff because it’s the 5th most common script in the dataset, as that’s just another way of saying all the Russian put together is all of 0.18% of the dataset. And keep in mind that the BPE encoding was not, AFAIK, redone for GPT-3, but is the same BPE OA has been using ever since GPT-2 way back when, and so was optimized for their Reddit-sourced English-heavy original WebText.
Wow, I didn’t realise I could get this angry about something so esoteric.