It mostly only means that training them compute optimally will require much more data, and doesn’t rule out OpenAI-style mostly-parameter scaling at all. Data scaling can be necessary to minimise loss to get optimal estimates of certain entropic variables, while still being unnecessary for general intelligence. Large undertrained models still learn faster. This new paper mostly makes parameter and data scaling both significantly more efficient, but data scaling to a larger degree, such that it’s more optimal to trade off these losses 1:1.
Below the fold is musing and analysis around this question. It is not a direct answer to it though.
We can take a look at the loss function, defined in terms of the irreducible loss, aka. the unmodelable entropy of language, the number of parameters N, and the number of data tokens D.
L(N,D)=1.69+406.4N0.34+410.7D0.28
If we put in the parameters for Chinchilla, we see 406.4N0.34≈0.083, and 410.7D0.28≈0.163. Although these equations have been locally tuned and are not valid in the infinite limit of a single variable, it does roughly say that just scaling parameter counts without training for longer will only tackle about a third of the remaining reducible loss.
Note the implicit assumption that we are working in the infinite data limit, where we never intentionally train on the same tokens twice. If you run out of data, it doesn’t mean you are no longer able to train your models for longer as you scale, it only means that you will have to make more use of the data you already have, which can mean as little as multiple epochs or as much as sophisticated bootstrapping methods.
(Note that the dataset was different so the exact losses shouldn’t be centered identically.)
This has major issues, like there is no irreducible loss and the values aren’t disentangled. We can still put in the parameters for GPT-3: 1.54e10N0.738≈77.7 and 1.8e13D≈60; or in the limits, (1.54e10N0.738)0.103≈1.57 and (1.8e13D)0.103≈1.52. It isn’t clear what this means about the necessary amount of data scaling, as in what fraction of the loss that it captures, especially because there is no entropy term, but it does mean that there is still about 1:1 contributions from both losses at the efficient point, at least if you ignore the fact that the equation is wrong. That you have to scale both in tandem to make maximal progress remains true in this older equation, it’s just more convoluted and has different factors.
What does the irreducible loss of 1.69 actually mean? I assume it’s something like entropy/symbol? What does that convert to in terms of entropy/word? Does that agree with the ‘standard’ approximations of the entropy of English text?
It’s the cross-entropy that is left after you scale to infinity, and it is measured per symbol, yes. It is measured using BPEs, and the unit is nats/token. It might be equal to the true entropy, but this is conjecture, as the model might never learn some aspects of language at any size within the regimes we can model.
For a large enough dataset, and given you are changing only the model and not the BPEs or data distribution, then the loss should be a constant factor multiple of bits/character, bits/byte, or bits/word. Chinchilla gets 0.667 bits/byte on pile_cc and a loss of 1.97 on Wikitext103 (1.97/0.667≈3), which is unhelpfully not at all controlled but should suffice for ballpark conversions.
It might be equal to the true entropy, but this is conjecture, as the model might never learn some aspects of language at any size within the regimes we can model.
That’s actually precisely what I’m interested in finding out. How closely this scaling would match the ‘expected’ entropy of English in the infinite limit. (Of course, this assumes that said approximation actually matches in the limit.)
Hm. Any idea what the compression level is of using BPE on English text? A quick look shows ~51%[1] compression ratio on BPE on the Brown corpus, which I suppose I could use as a starting point.
So if I’m understanding correctly (one nat == 1.4 bits of entropy), ~2.43 bits / token? Assuming a BPE compression ratio of 51.08% on English text (each token encoding 4.0864 bits, given 51.08% compression on what I assume to be 8-bit ASCII), that means ~0.595 bits / character.
...which actually matches Shannon’s estimation of the entropy of English surprisingly well (0.6-1.3 bits / character).
This is the vocab file GPT uses. Don’t stare too long, I have heard the jank is too great for human conception. I might already be infected. Most models don’t bother changing the BPEs, but those that do probably don’t have it any better. (This is machine learning where your inputs can be almost infinitely awful and nothing will stop working as long as your models are large enough.)
rawdownloadcloneembedreportprint
True entropy of text is not the best defined, and it’s hard to tell whether something the model can’t learn regardless of scale is a true feature of the distribution or just intractable. I would say that models do seem to be capturing the shape of what looks to my mind like the true distribution, and if they do fall short in the limit, it shouldn’t be by very much.
��������
I noted that Chinchilla gets 0.667 bits/byte on pile_cc, which is basically the same as bits per character on random internet text. The difference being that pile_cc isn’t ASCII, but that makes up a sufficiently large fraction that I wouldn’t worry about the details.
It mostly only means that training them compute optimally will require much more data, and doesn’t rule out OpenAI-style mostly-parameter scaling at all. Data scaling can be necessary to minimise loss to get optimal estimates of certain entropic variables, while still being unnecessary for general intelligence. Large undertrained models still learn faster. This new paper mostly makes parameter and data scaling both significantly more efficient, but data scaling to a larger degree, such that it’s more optimal to trade off these losses 1:1.
Below the fold is musing and analysis around this question. It is not a direct answer to it though.
We can take a look at the loss function, defined in terms of the irreducible loss, aka. the unmodelable entropy of language, the number of parameters N, and the number of data tokens D.
L(N,D)=1.69+406.4N0.34+410.7D0.28
If we put in the parameters for Chinchilla, we see 406.4N0.34≈0.083, and 410.7D0.28≈0.163. Although these equations have been locally tuned and are not valid in the infinite limit of a single variable, it does roughly say that just scaling parameter counts without training for longer will only tackle about a third of the remaining reducible loss.
Note the implicit assumption that we are working in the infinite data limit, where we never intentionally train on the same tokens twice. If you run out of data, it doesn’t mean you are no longer able to train your models for longer as you scale, it only means that you will have to make more use of the data you already have, which can mean as little as multiple epochs or as much as sophisticated bootstrapping methods.
The original scaling laws did not decompose so easily. I present them in simplified form.
L(N,D)=(1.54e10N0.738+1.8e13D)0.103
(Note that the dataset was different so the exact losses shouldn’t be centered identically.)
This has major issues, like there is no irreducible loss and the values aren’t disentangled. We can still put in the parameters for GPT-3: 1.54e10N0.738≈77.7 and 1.8e13D≈60; or in the limits, (1.54e10N0.738)0.103≈1.57 and (1.8e13D)0.103≈1.52. It isn’t clear what this means about the necessary amount of data scaling, as in what fraction of the loss that it captures, especially because there is no entropy term, but it does mean that there is still about 1:1 contributions from both losses at the efficient point, at least if you ignore the fact that the equation is wrong. That you have to scale both in tandem to make maximal progress remains true in this older equation, it’s just more convoluted and has different factors.
Interesting!
What does the irreducible loss of 1.69 actually mean? I assume it’s something like entropy/symbol? What does that convert to in terms of entropy/word? Does that agree with the ‘standard’ approximations of the entropy of English text?
It’s the cross-entropy that is left after you scale to infinity, and it is measured per symbol, yes. It is measured using BPEs, and the unit is nats/token. It might be equal to the true entropy, but this is conjecture, as the model might never learn some aspects of language at any size within the regimes we can model.
For a large enough dataset, and given you are changing only the model and not the BPEs or data distribution, then the loss should be a constant factor multiple of bits/character, bits/byte, or bits/word. Chinchilla gets 0.667 bits/byte on pile_cc and a loss of 1.97 on Wikitext103 (1.97/0.667≈3), which is unhelpfully not at all controlled but should suffice for ballpark conversions.
That’s actually precisely what I’m interested in finding out. How closely this scaling would match the ‘expected’ entropy of English in the infinite limit. (Of course, this assumes that said approximation actually matches in the limit.)
Hm. Any idea what the compression level is of using BPE on English text? A quick look shows ~51%[1] compression ratio on BPE on the Brown corpus, which I suppose I could use as a starting point.
So if I’m understanding correctly (one nat == 1.4 bits of entropy), ~2.43 bits / token? Assuming a BPE compression ratio of 51.08% on English text (each token encoding 4.0864 bits, given 51.08% compression on what I assume to be 8-bit ASCII), that means ~0.595 bits / character.
...which actually matches Shannon’s estimation of the entropy of English surprisingly well (0.6-1.3 bits / character).
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.46.4046&rep=rep1&type=pdf
This is the vocab file GPT uses. Don’t stare too long, I have heard the jank is too great for human conception. I might already be infected. Most models don’t bother changing the BPEs, but those that do probably don’t have it any better. (This is machine learning where your inputs can be almost infinitely awful and nothing will stop working as long as your models are large enough.)
rawdownloadcloneembedreportprint
True entropy of text is not the best defined, and it’s hard to tell whether something the model can’t learn regardless of scale is a true feature of the distribution or just intractable. I would say that models do seem to be capturing the shape of what looks to my mind like the true distribution, and if they do fall short in the limit, it shouldn’t be by very much.
��������
I noted that Chinchilla gets 0.667 bits/byte on pile_cc, which is basically the same as bits per character on random internet text. The difference being that pile_cc isn’t ASCII, but that makes up a sufficiently large fraction that I wouldn’t worry about the details.
ĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤ