The likelihood loss intersection point is very vague, as they point out, as it only weakly suggests, for that specific architecture/training method/dataset, a crossover to a slower-scaling curve requiring increasing data more anywhere between 10^4 and 10^6 or so. As GPT-3 hits 10^3 and is still dead on the scaling curve, it seems that any crossover will happen much higher than lower. (I suspect part of what’s going on there is the doubled context window: as Nostalgebraist notes, their experiments with 1024 ctx strongly suggests that the more context window you have, the more you can learn profitably, so doubling to 2048 ctx probably pushed off the crossover quite a bit. Obviously, they have a long way to go there.) So the crossover itself, much less negative profitability of scaling, may be outside the current 100-1000x being mooted. (I’d also note that I don’t see why they are so baffled at the suggestion that a model could overfit in a single epoch. Have they looked at the Internet lately? It is not remotely a clean, stationary, minimal, or i.i.d. dataset, even after cleaning & deduplication.)
I also think that given everything we’ve learned about prompt programming and the large increases in benchmarks like arithmetic or WiC, making arguments from pseudo-lack-of-scaling in the paper’s benchmarks is somewhere between foolish and misleading, at least until we have an equivalent set of finetuning benchmarks which should cut through the problem of good prompting (however bad the default prompt is, biasing performance downwards, some finetuning should quickly fix that regardless of meta-learning) and show what GPT-3 can really do.
The likelihood loss intersection point is very vague, as they point out, as it only weakly suggests, for that specific architecture/training method/dataset, a crossover to a slower-scaling curve requiring increasing data more anywhere between 10^4 and 10^6 or so. As GPT-3 hits 10^3 and is still dead on the scaling curve, it seems that any crossover will happen much higher than lower. (I suspect part of what’s going on there is the doubled context window: as Nostalgebraist notes, their experiments with 1024 ctx strongly suggests that the more context window you have, the more you can learn profitably, so doubling to 2048 ctx probably pushed off the crossover quite a bit. Obviously, they have a long way to go there.) So the crossover itself, much less negative profitability of scaling, may be outside the current 100-1000x being mooted. (I’d also note that I don’t see why they are so baffled at the suggestion that a model could overfit in a single epoch. Have they looked at the Internet lately? It is not remotely a clean, stationary, minimal, or i.i.d. dataset, even after cleaning & deduplication.)
I also think that given everything we’ve learned about prompt programming and the large increases in benchmarks like arithmetic or WiC, making arguments from pseudo-lack-of-scaling in the paper’s benchmarks is somewhere between foolish and misleading, at least until we have an equivalent set of finetuning benchmarks which should cut through the problem of good prompting (however bad the default prompt is, biasing performance downwards, some finetuning should quickly fix that regardless of meta-learning) and show what GPT-3 can really do.