Musings on Text Data Wall (Oct 2024)
The Chinchilla scaling law says that for a given number of FLOPs C, the optimal amount of data D to train on is proportional to (C to the power of b), that is you’ll get a less intelligent model if you use either more data or less data than that while training for C FLOPs. The optimal model size (number of parameters) N is then whatever would require C FLOPs when trained on D tokens of data, and C is usually about 6ND, so optimal N is proportional to (C to the power of (1-b)). See Figure 4 in the paper. Turns out that b is close to 0.5, so N and D increase with C at a similar pace, and their ratio D/N (tokens per parameter) remains approximately the same across multiple orders of magnitude for available compute C.
In the Chinchilla paper, the exponent b is estimated as 0.50-0.54 using different methods[1] (see Table 3), and D/N at 3e21 FLOPs is about 20 tokens per parameter. If b is taken to be 0.50, then D/N doesn’t change with compute. But if b is 0.54, then D/N increases proportionally to (C to the power of 0.08). At 2e25 FLOPs of original GPT-4 that would result in D/N of 40 tokens/parameter, at 5e26 FLOPs of the next year’s models it becomes 52, and at 7e27 of late 2026 models from 1 gigawatt training systems it becomes 64 tokens/parameter. So even the original Chinchilla paper doesn’t quite promise 20 tokens/parameter for 5e28 FLOPs with some of its estimates of b, you’d need to stick to the estimate of 0.50, and there isn’t enough data to confidently make that decision, not 7 orders of magnitude beyond the scale of the actual experiments that only go up to 3e21 FLOPs (Chinchilla itself is 6e23 FLOPs, at 200 times more compute, where we get D/N of 30 for b of 0.54; but they went with b of 0.50).
The picture of steady 20 tokens/parameter is disrupted further with more recent papers. Llama 3 report measures the exponent b of 0.53 with experiments of up to 1e22 FLOPs. At 3.8e25 FLOPs of Llama-3-405B, this predicts a D/N ratio of 41 tokens per parameter (see Figures 2 and 3). Back at 3e21, this predicts a D/N ratio of 23[2], consistent with the Chinchilla experiments at that scale. For 5e26 FLOPs, this predicts D/N of 48, for 7e27 FLOPs a D/N of 56.
There is also the DeepSeek report where the estimate of exponent b is 0.47 (see Figure 4) from experiments of up to 3e20 FLOPs. This again predicts a D/N of about 20 at 3e21 FLOPs, even though the extrapolation to 4e25 FLOPs now promises D/N of 12. The dataset is of course different, and the exponent b of 0.47 is consistent even with Chinchilla’s original experiment on a GitHub dataset in Appendix C of that paper.
Finally, there is a recent Imbue blog post on CARBS, their hyperparameter optimization method (search for “per parameter”). The plot indicates an exponent b above 0.50, measured on training runs of up to 3e21 FLOPs (their GPUs are H100s, so assuming 40% utilization and dense BF16, that’s about 100 GPU-days). Across 100x of compute, D/N rises from about 12 to about 28, for an exponent b of about 0.59. This would predict D/N of 154 at 4e25 FLOPs.
Thus at higher compute, the optimal token per parameter ratio remains highly uncertain, could be anywhere from 20 to 200. The isoFLOPs curves seem to flatten slightly with more compute (Chinchilla plot, Llama 3 plot, DeepSeek plot), so it might matter less there, and in a data constrained regime that means we can get away with somewhat less data than optimal without significant degradation of performance, that is with a lower D/N. But even at D/N of 60 tokens per parameter, a 7e27 FLOPs run (500K B200s in BF16 for 6 months at 40% utilization) would need 260 trillion tokens (for a 4.4 trillion active parameters model).
Training on Repeated Data
A May 2023 paper shows that data can be repeated many times when training language models. Repeating about 5 times seems to work about as well as having 5 times more unique data, and repeating about 15 times still works OK (see Figure 5). Then there are diminishing returns, and at 60 repetitions the results start getting worse with more data, not better (see Figure 3 and Figure 10 in Appendix E). In the context of Chinchilla scaling, the important caveat is that the optimal D/N ratio starts increasing when data is repeated, measuring at about 40 tokens per parameter at 60 repetitions and 5e18 FLOPs, and about 30 tokens/parameter at the more useful level of 20 repetitions (Figure 3 again). At this compute scale, that’s an unusually high tokens per parameter ratio, weakly suggesting that repeating data at higher compute might act as an additional factor of about 2-3x in increasing the optimal ratio.
Another interesting observation from that paper is that starting with a dataset, perplexity-filtering away half of it and then repeating the remaining half twice can give better results than training on the whole original dataset (Figure 6, right, yellow star in the top-middle). This weakly suggests that even when there are 250 trillion tokens to be found from crawling the web, it might be better to select 50 trillion tokens and repeat 5 times rather than to use all 250 trillion tokens once.
Cost of Reaching the Text Data Wall
Suppose there are 100 trillion tokens of natural text data to be found, which are more useful when kept rather than thrown away and replaced with repetitions of better tokens. Repeated 20 times, this gives 2e15 tokens. At a guessed D/N of 100, this uses 2.4e29 FLOPs “optimally”, which at the data scaling exponent of 0.53 refines the D/N to 70 (extrapolating fom D/N of 41 for 4e25 FLOPs of Llama-3-405B).
From doing 20 repetitions rather than only using data once, the D/N might increase 2x, giving the final estimate of 150 for tokens/parameters at that scale, which with 2e15 tokens asks for a 13T active parameter model, for a total of 1.6e29 FLOPs, about 10,000x the compute of original GPT-4.
An Nvidia B200 GPU, from which the training clusters of 2025 are going to be built, taken together with a corresponding fraction of the rest of the datacenter, consumes about 2 kilowatts and produces about 2.2e15 dense BF16 FLOP/s. To get 1.6e29 FLOPs at 30% utilization in 6 months, we’d need 15 million B200s. A training system of that scale would consume about 30 gigawatts and cost about $800 billion. This is of course not happening with B200s, and future GPUs will be more cost-efficient. But it’s also plausibly not too far away.
Small Models and the Data Wall
The data wall comes for small models first, because it’s cheaper to train them on outrageous amounts of data, with wildly suboptimal number of tokens per parameter. For some reason, publicly reported small models are not being trained on much more data than publicly reported Chinchilla optimal models. With repetition of data, that doesn’t even require preparing larger datasets for training.
For small models, knowledge distillation can be used to improve data efficiency. The logits from a larger teacher model (probability distributions over predicted tokens) are collected for the whole training dataset, and then used as targets for prediction (instead of the exact tokens) when training smaller models. This is not cheap to do on the whole dataset with the largest model, but it’s still several times cheaper than the training of the largest model, and even cheaper if it was trained on repeated data, since you only need to compute the logits for one copy of the data. This was used in Gemma 2, where it seems to recover similar performance with 2x fewer tokens (see Table 7). It was also recently used for 1B and 3B models of Llama-3.2, though there is no technical report and the teacher models are smaller than Llama-3-405B.
With small models, all 60 repetitions of data might be reasonable (where the results get worse with more, however many repetitions that actually turns out to be at the relevant scale), and with knowledge distillation they might count for more. For a 1B model, training on 16 trillion tokens of Llama 3 repeated 60 times costs 6e24 FLOPs, about the same as preparing the dataset using Llama-3-405B as the teacher for knowledge distillation. For a 9B model, training on 100 trillion tokens repeated 60 times costs about 3e26 FLOPs, which one of the 100K H100s clusters already available in 2024 can process in 3 months[3].
Beyond 60 repetitions of data in small scale experiments, loss gets worse up to 200 repetitions, but then there is double descent and it starts getting better again (see Figure 9 in Appendix D of the Scaling Data-Constrained Language Models paper). So it might be the case that there is further improvement after hundreds of repetitions, or this effect might appear at greater scale where models are capable of learning more general circuits.
- ↩︎
There is also a measurement for b on an alternative dataset not intended to train Chinchilla, which gives a b of 0.47, see table A2 in the Appendix C.
- ↩︎
One dot on Figure 3 is at 3e21 FLOPs and 1e11 tokens, so 5e9 parameters, exactly 20 tokens/parameter. This is an observation from the 3e21 isoFLOP curve on Figure 2.
- ↩︎
The minibatch size won’t be reasonable if done straightforwardly, but it might be possible to resolve this issue by doing something like DiLoCo, with inner Adam optimizers at smaller batch sizes and an outer optimizer with Nesterov momentum.
Synthetically enhancing and/or generating data could be another dimension of scaling. Imagine how much deeper understanding a person/LLM would have if instead of simply reading/training on a source like the Bible N times, they had to annotate it into something more like the Oxford Annotated Bible and that whole process of annotation became training data.
Some thoughts
When training multiple times on the same data, this is a method that helps avoid overfitting/memorization: https://arxiv.org/abs/2406.10209
Certain text sources are really information rich and could benefit from unpacking. In particular, academic papers could be richly annotated with auto-generated details like relevant quotes from each of the cited papers, or llm-generated comprehension questions and answers, verbal descriptions and analysis of the figures. I think that’d stretch the data a lot in high quality areas.
There’s been progress on novel algorithms/architectures which trade off training efficiency for saturation/forgetting resistance. E.g KANformers, mixtures of millions of experts. That could make a small model stretch a lot further.