The best argument against scaling working, from what I have seen, is the data bottleneck
A $10B-$100B training run could maybe employ about 1e28 FLOPs with about 1M GPUs, it’s not feasible to get much more on short notice. With training efficiency improvements, this might translate into 1e29 FLOPs of effective compute.
estimate that repeating data 16 times is still useful. Chinchilla scaling laws estimate that a dense transformer with X parameters should use 20X tokens to make the best use of the requisite 6*X*20X FLOPs of compute. Notice that X is squared in the FLOPs estimate, so repeating data 16 times means ability to make use of 256 times more FLOPs. Crunching the numbers, I get 2e29 FLOPs of compute for 50T tokens of training data (with even more effective compute). There’s a filtered and deduplicated
CommonCrawl dataset RedPajama-Data-v2
with 30T tokens.
So we are good on data for the next few years. It’s not high quality data, but it’s currently unknown if it will nonetheless suffice. GPT-4 doesn’t look like it can be scaffolded into something competent enough to be transformative. But going through another 3-4 OOMs of compute after GPT-4 is a new experiment that can reasonably be expected to yield either result.
A $10B-$100B training run could maybe employ about 1e28 FLOPs with about 1M GPUs, it’s not feasible to get much more on short notice. With training efficiency improvements, this might translate into 1e29 FLOPs of effective compute.
The scaling laws for use of repeated data
N Muennighoff et al. (2023) Scaling Data-Constrained Language Models
estimate that repeating data 16 times is still useful. Chinchilla scaling laws estimate that a dense transformer with X parameters should use 20X tokens to make the best use of the requisite 6*X*20X FLOPs of compute. Notice that X is squared in the FLOPs estimate, so repeating data 16 times means ability to make use of 256 times more FLOPs. Crunching the numbers, I get 2e29 FLOPs of compute for 50T tokens of training data (with even more effective compute). There’s a filtered and deduplicated CommonCrawl dataset RedPajama-Data-v2 with 30T tokens.
So we are good on data for the next few years. It’s not high quality data, but it’s currently unknown if it will nonetheless suffice. GPT-4 doesn’t look like it can be scaffolded into something competent enough to be transformative. But going through another 3-4 OOMs of compute after GPT-4 is a new experiment that can reasonably be expected to yield either result.