It’s the cross-entropy that is left after you scale to infinity, and it is measured per symbol, yes. It is measured using BPEs, and the unit is nats/token. It might be equal to the true entropy, but this is conjecture, as the model might never learn some aspects of language at any size within the regimes we can model.
For a large enough dataset, and given you are changing only the model and not the BPEs or data distribution, then the loss should be a constant factor multiple of bits/character, bits/byte, or bits/word. Chinchilla gets 0.667 bits/byte on pile_cc and a loss of 1.97 on Wikitext103 (1.97/0.667≈3), which is unhelpfully not at all controlled but should suffice for ballpark conversions.
It might be equal to the true entropy, but this is conjecture, as the model might never learn some aspects of language at any size within the regimes we can model.
That’s actually precisely what I’m interested in finding out. How closely this scaling would match the ‘expected’ entropy of English in the infinite limit. (Of course, this assumes that said approximation actually matches in the limit.)
Hm. Any idea what the compression level is of using BPE on English text? A quick look shows ~51%[1] compression ratio on BPE on the Brown corpus, which I suppose I could use as a starting point.
So if I’m understanding correctly (one nat == 1.4 bits of entropy), ~2.43 bits / token? Assuming a BPE compression ratio of 51.08% on English text (each token encoding 4.0864 bits, given 51.08% compression on what I assume to be 8-bit ASCII), that means ~0.595 bits / character.
...which actually matches Shannon’s estimation of the entropy of English surprisingly well (0.6-1.3 bits / character).
This is the vocab file GPT uses. Don’t stare too long, I have heard the jank is too great for human conception. I might already be infected. Most models don’t bother changing the BPEs, but those that do probably don’t have it any better. (This is machine learning where your inputs can be almost infinitely awful and nothing will stop working as long as your models are large enough.)
rawdownloadcloneembedreportprint
True entropy of text is not the best defined, and it’s hard to tell whether something the model can’t learn regardless of scale is a true feature of the distribution or just intractable. I would say that models do seem to be capturing the shape of what looks to my mind like the true distribution, and if they do fall short in the limit, it shouldn’t be by very much.
��������
I noted that Chinchilla gets 0.667 bits/byte on pile_cc, which is basically the same as bits per character on random internet text. The difference being that pile_cc isn’t ASCII, but that makes up a sufficiently large fraction that I wouldn’t worry about the details.
It’s the cross-entropy that is left after you scale to infinity, and it is measured per symbol, yes. It is measured using BPEs, and the unit is nats/token. It might be equal to the true entropy, but this is conjecture, as the model might never learn some aspects of language at any size within the regimes we can model.
For a large enough dataset, and given you are changing only the model and not the BPEs or data distribution, then the loss should be a constant factor multiple of bits/character, bits/byte, or bits/word. Chinchilla gets 0.667 bits/byte on pile_cc and a loss of 1.97 on Wikitext103 (1.97/0.667≈3), which is unhelpfully not at all controlled but should suffice for ballpark conversions.
That’s actually precisely what I’m interested in finding out. How closely this scaling would match the ‘expected’ entropy of English in the infinite limit. (Of course, this assumes that said approximation actually matches in the limit.)
Hm. Any idea what the compression level is of using BPE on English text? A quick look shows ~51%[1] compression ratio on BPE on the Brown corpus, which I suppose I could use as a starting point.
So if I’m understanding correctly (one nat == 1.4 bits of entropy), ~2.43 bits / token? Assuming a BPE compression ratio of 51.08% on English text (each token encoding 4.0864 bits, given 51.08% compression on what I assume to be 8-bit ASCII), that means ~0.595 bits / character.
...which actually matches Shannon’s estimation of the entropy of English surprisingly well (0.6-1.3 bits / character).
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.46.4046&rep=rep1&type=pdf
This is the vocab file GPT uses. Don’t stare too long, I have heard the jank is too great for human conception. I might already be infected. Most models don’t bother changing the BPEs, but those that do probably don’t have it any better. (This is machine learning where your inputs can be almost infinitely awful and nothing will stop working as long as your models are large enough.)
rawdownloadcloneembedreportprint
True entropy of text is not the best defined, and it’s hard to tell whether something the model can’t learn regardless of scale is a true feature of the distribution or just intractable. I would say that models do seem to be capturing the shape of what looks to my mind like the true distribution, and if they do fall short in the limit, it shouldn’t be by very much.
��������
I noted that Chinchilla gets 0.667 bits/byte on pile_cc, which is basically the same as bits per character on random internet text. The difference being that pile_cc isn’t ASCII, but that makes up a sufficiently large fraction that I wouldn’t worry about the details.
ĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤ