Isn’t GPT3 already almost at the theoretical limit of the scaling law from the paper? This is what is argued by nostalgebraist in his blog and colab notebook. You also get this result if you just compare the 3.14E23 FLOP (i.e. 3.6k PFLOPS-days) cost of training GPT3 from the lambdalabs estimate to the ~10k PFLOPS-days limit from the paper.
(Of course, this doesn’t imply that the post is wrong. I’m sure it’s possible to train a radically larger GPT right now. It’s just that the relevant bound is the availability of data, not of compute power.)
Though it’s not mentioned in the paper, I feel like this could be because the scaling analysis was done on 1024-token sequences. Maybe longer sequences can go further.
It’s indeed strange no-one else has picked up on this, which makes me feel I’m misunderstanding something. The breakdown suggested in the scaling law does imply that this specific architecture doesn’t have much further to go. Whether the limitation is in something as fundamental as ‘the information content of language itself’, or if it’s a more-easily bypassed ‘the information content of 1024-token strings’, is unclear.
My instinct is for the latter, though again by the way no-one else has mentioned it—even the paper authors—I get the uncomfortable feeling I’m misunderstanding something. That said, being able to write that quote a few days ago and since have no-one pull me up on it has increased my confidence that it’s a viable interpretation.
They do discuss this a little bit in that scaling paper, in Appendix D.6. (edit: actually Appendix D.5)
At least in their experimental setup, they find that the first 8 tokens are predicted better by a model with only 8 tokens its its window than one with 1024 tokens, if the two have equally many parameters. And that later tokens are harder to predict, and hence require more parameters if you want to reach some given loss threshold.
I’ll have to think more about this and what it might mean for their other scaling laws… at the very least, it’s an effect which their analysis treats as approximately zero, and math/physics models with such approximations often break down in a subset of cases.
While you’re here and chatting about D.5 (assume you meant 5), another tiny thing that confuses me—Figure 21. Am I right in reading the bottom two lines as ‘seeing 255 tokens and predicting the 256th is exactly as difficult as seeing 1023 tokens and predicting the 1024th’?
e: Another look and I realise Fig 20 shows things much more clearly—never mind, things continue to get easier with token index.
The likelihood loss intersection point is very vague, as they point out, as it only weakly suggests, for that specific architecture/training method/dataset, a crossover to a slower-scaling curve requiring increasing data more anywhere between 10^4 and 10^6 or so. As GPT-3 hits 10^3 and is still dead on the scaling curve, it seems that any crossover will happen much higher than lower. (I suspect part of what’s going on there is the doubled context window: as Nostalgebraist notes, their experiments with 1024 ctx strongly suggests that the more context window you have, the more you can learn profitably, so doubling to 2048 ctx probably pushed off the crossover quite a bit. Obviously, they have a long way to go there.) So the crossover itself, much less negative profitability of scaling, may be outside the current 100-1000x being mooted. (I’d also note that I don’t see why they are so baffled at the suggestion that a model could overfit in a single epoch. Have they looked at the Internet lately? It is not remotely a clean, stationary, minimal, or i.i.d. dataset, even after cleaning & deduplication.)
I also think that given everything we’ve learned about prompt programming and the large increases in benchmarks like arithmetic or WiC, making arguments from pseudo-lack-of-scaling in the paper’s benchmarks is somewhere between foolish and misleading, at least until we have an equivalent set of finetuning benchmarks which should cut through the problem of good prompting (however bad the default prompt is, biasing performance downwards, some finetuning should quickly fix that regardless of meta-learning) and show what GPT-3 can really do.
Isn’t GPT3 already almost at the theoretical limit of the scaling law from the paper? This is what is argued by nostalgebraist in his blog and colab notebook. You also get this result if you just compare the 3.14E23 FLOP (i.e. 3.6k PFLOPS-days) cost of training GPT3 from the lambdalabs estimate to the ~10k PFLOPS-days limit from the paper.
(Of course, this doesn’t imply that the post is wrong. I’m sure it’s possible to train a radically larger GPT right now. It’s just that the relevant bound is the availability of data, not of compute power.)
It’s indeed strange no-one else has picked up on this, which makes me feel I’m misunderstanding something. The breakdown suggested in the scaling law does imply that this specific architecture doesn’t have much further to go. Whether the limitation is in something as fundamental as ‘the information content of language itself’, or if it’s a more-easily bypassed ‘the information content of 1024-token strings’, is unclear.
My instinct is for the latter, though again by the way no-one else has mentioned it—even the paper authors—I get the uncomfortable feeling I’m misunderstanding something. That said, being able to write that quote a few days ago and since have no-one pull me up on it has increased my confidence that it’s a viable interpretation.
They do discuss this a little bit in that scaling paper, in Appendix D.6. (edit: actually Appendix D.5)
At least in their experimental setup, they find that the first 8 tokens are predicted better by a model with only 8 tokens its its window than one with 1024 tokens, if the two have equally many parameters. And that later tokens are harder to predict, and hence require more parameters if you want to reach some given loss threshold.
I’ll have to think more about this and what it might mean for their other scaling laws… at the very least, it’s an effect which their analysis treats as approximately zero, and math/physics models with such approximations often break down in a subset of cases.
While you’re here and chatting about D.5 (assume you meant 5), another tiny thing that confuses me—Figure 21. Am I right in reading the bottom two lines as ‘seeing 255 tokens and predicting the 256th is exactly as difficult as seeing 1023 tokens and predicting the 1024th’?
e: Another look and I realise Fig 20 shows things much more clearly—never mind, things continue to get easier with token index.
The likelihood loss intersection point is very vague, as they point out, as it only weakly suggests, for that specific architecture/training method/dataset, a crossover to a slower-scaling curve requiring increasing data more anywhere between 10^4 and 10^6 or so. As GPT-3 hits 10^3 and is still dead on the scaling curve, it seems that any crossover will happen much higher than lower. (I suspect part of what’s going on there is the doubled context window: as Nostalgebraist notes, their experiments with 1024 ctx strongly suggests that the more context window you have, the more you can learn profitably, so doubling to 2048 ctx probably pushed off the crossover quite a bit. Obviously, they have a long way to go there.) So the crossover itself, much less negative profitability of scaling, may be outside the current 100-1000x being mooted. (I’d also note that I don’t see why they are so baffled at the suggestion that a model could overfit in a single epoch. Have they looked at the Internet lately? It is not remotely a clean, stationary, minimal, or i.i.d. dataset, even after cleaning & deduplication.)
I also think that given everything we’ve learned about prompt programming and the large increases in benchmarks like arithmetic or WiC, making arguments from pseudo-lack-of-scaling in the paper’s benchmarks is somewhere between foolish and misleading, at least until we have an equivalent set of finetuning benchmarks which should cut through the problem of good prompting (however bad the default prompt is, biasing performance downwards, some finetuning should quickly fix that regardless of meta-learning) and show what GPT-3 can really do.