NB: Loss ≠ perplexity. Perplexity is the exponential of the entropy, and you have to take a logarithm before comparing it to bits-per-thing. 1.69 is a loss, not a perplexity, which is already in nats (which are a constant factor different to bits). An example of perplexity is Chinchilla getting 7.16 (~e1.97) on Wikitext103.
BPEs are, I think, roughly equivalent to like 3 characters or bytes
A nat-per-BPE is about 1⁄3bits-per-byte. A BPE is thus around 4.3 (log2(7.16)0.667≈4.26) characters. I am not 100% sure I did that right but that seems like a more sensible answer.
It is annoying that one paper uses three different units for the same thing depending on the dataset, and the base isn’t even explicit in some of them, instead of just reporting everything in bits per byte. But what are you going to do, expect people to coordinate? Ridiculous. Much better to just confuse people all the time.
I am not 100% sure I did that right but that seems like a more sensible answer.
Eyy, I should trust myself more. Verified on Pile-CC.
(GPT-2/3 BPE)
>>> k = 100000000; k / len(tokenizer(cc[:k])["input_ids"])
4.355680325470372
(T5 sentencepiece)
>>> k = 10000000; k / len(tokenizer(cc[:k])["input_ids"])
4.182535904979476
NB: Loss ≠ perplexity. Perplexity is the exponential of the entropy, and you have to take a logarithm before comparing it to bits-per-thing. 1.69 is a loss, not a perplexity, which is already in nats (which are a constant factor different to bits). An example of perplexity is Chinchilla getting 7.16 (~e1.97) on Wikitext103.
A nat-per-BPE is about 1⁄3 bits-per-byte. A BPE is thus around 4.3 (log2(7.16)0.667≈4.26) characters. I am not 100% sure I did that right but that seems like a more sensible answer.
It is annoying that one paper uses three different units for the same thing depending on the dataset, and the base isn’t even explicit in some of them, instead of just reporting everything in bits per byte. But what are you going to do, expect people to coordinate? Ridiculous. Much better to just confuse people all the time.
Eyy, I should trust myself more. Verified on Pile-CC.