Data is running out for making overtrained models, not Chinchilla-optimal models, because you can repeat data (there’s also a recent hour-long presentation by one of the authors). This systematic study was published only in May 2023, though the Galactica paper from Nov 2022 also has a result to this effect (see Figure 6). The preceding popular wisdom was that you shouldn’t repeat data for language models, so cached thoughts that don’t take this result into account are still plentiful, and also it doesn’t sufficiently rescue highly overtrained models, so the underlying concern still has some merit.
As you repeat data more and more, the Chinchilla multiplier of data/parameters (data in tokens divided by number of active parameters for an optimal use of given compute) gradually increases from 20 to 60 (see the data-constrained efficient frontier curve in Figure 5 that tilts lower on the parameters/data plot, deviating from the Chinchilla efficient frontier line for data without repetition). You can repeat data essentially without penalty about 4 times, efficiently 16 times, and with any use at all 60 times (at some point even increasing parameters while keeping data unchanged starts decreasing rather than increasing performance). This gives a use for up to 100x more compute, compared to Chinchilla optimal use of data that is not repeated, while retaining some efficiency (at 16x repetition of data). Or up to 1200x more compute for the marginally useful 60x repetition of data.
The datasets you currently see at 15-30T tokens scale are still highly filtered compared to available raw data (see Figure 4). The scale feasible within a few years is about 2e28-1e29 FLOPs) (accounting for hypothetical hardware improvement and larger datacenters of early 2030s; this is physical, not effective compute). Chinchilla optimal compute for a 50T token dataset is about 8e26 FLOPs, which turns into 8e28 FLOPs with 16x repetition of data, up to 9e29 FLOPs for the barely useful 60x repetition. Note that sometimes it’s better to perplexity-filter away half of a dataset and repeat it twice than to use the whole original dataset (yellow star in Figure 6; discussion in the presentation), so using highly repeated data on 50T tokens might still outperform less-repeated usage of less-filtered data, which is to say finding 100T tokens by filtering less doesn’t necessarily work at all. There’s also some double descent for repetition (Appendix D; discussion in the presentation), which suggests that it might be possible to overcome the 60x repetition barrier (Appendix E) with sufficient compute or better algorithms.
In any case the OOMs match between what repeated data allows and the compute that’s plausibly available in the near future (4-8 years). There’s also probably a significant amount of data to be found that’s not on the web, and every 2x increase in unique reasonable quality data means 4x increase in compute. Where data gets truly scarce soon is for highly overtrained inference-efficient models.
I agree that repeated training will change the picture somewhat. One thing I find quite nice about the linked Epoch paper is that the range of tokens is an order of magnitude, and even though many people have ideas for getting more data (common things I hear include “use private platform data like messaging apps”), most of these don’t change the picture because they don’t move things more than an order of magnitude, and the scaling trends want more orders of magnitude, not merely 2x.
Repeated data is the type of thing that plausibly adds an order of magnitude or maybe more.
The point is that you need to get quantitative in these estimates to claim that data is running out, since it has to run out compared to available compute, not merely on its own. And the repeated data argument seems by itself sufficient to show that it doesn’t in fact run out in this sense.
Data still seems to be running out for overtrained models, which is a major concern for LLM labs, so from their point of view there is indeed a salient data wall that’s very soon going to become a problem. There are rumors of synthetic data (which often ambiguously gesture at post-training results while discussing the pre-training data wall), but no published research for how something like that improves the situation with pre-training over using repeated data.
Data is running out for making overtrained models, not Chinchilla-optimal models, because you can repeat data (there’s also a recent hour-long presentation by one of the authors). This systematic study was published only in May 2023, though the Galactica paper from Nov 2022 also has a result to this effect (see Figure 6). The preceding popular wisdom was that you shouldn’t repeat data for language models, so cached thoughts that don’t take this result into account are still plentiful, and also it doesn’t sufficiently rescue highly overtrained models, so the underlying concern still has some merit.
As you repeat data more and more, the Chinchilla multiplier of data/parameters (data in tokens divided by number of active parameters for an optimal use of given compute) gradually increases from 20 to 60 (see the data-constrained efficient frontier curve in Figure 5 that tilts lower on the parameters/data plot, deviating from the Chinchilla efficient frontier line for data without repetition). You can repeat data essentially without penalty about 4 times, efficiently 16 times, and with any use at all 60 times (at some point even increasing parameters while keeping data unchanged starts decreasing rather than increasing performance). This gives a use for up to 100x more compute, compared to Chinchilla optimal use of data that is not repeated, while retaining some efficiency (at 16x repetition of data). Or up to 1200x more compute for the marginally useful 60x repetition of data.
The datasets you currently see at 15-30T tokens scale are still highly filtered compared to available raw data (see Figure 4). The scale feasible within a few years is about 2e28-1e29 FLOPs) (accounting for hypothetical hardware improvement and larger datacenters of early 2030s; this is physical, not effective compute). Chinchilla optimal compute for a 50T token dataset is about 8e26 FLOPs, which turns into 8e28 FLOPs with 16x repetition of data, up to 9e29 FLOPs for the barely useful 60x repetition. Note that sometimes it’s better to perplexity-filter away half of a dataset and repeat it twice than to use the whole original dataset (yellow star in Figure 6; discussion in the presentation), so using highly repeated data on 50T tokens might still outperform less-repeated usage of less-filtered data, which is to say finding 100T tokens by filtering less doesn’t necessarily work at all. There’s also some double descent for repetition (Appendix D; discussion in the presentation), which suggests that it might be possible to overcome the 60x repetition barrier (Appendix E) with sufficient compute or better algorithms.
In any case the OOMs match between what repeated data allows and the compute that’s plausibly available in the near future (4-8 years). There’s also probably a significant amount of data to be found that’s not on the web, and every 2x increase in unique reasonable quality data means 4x increase in compute. Where data gets truly scarce soon is for highly overtrained inference-efficient models.
I agree that repeated training will change the picture somewhat. One thing I find quite nice about the linked Epoch paper is that the range of tokens is an order of magnitude, and even though many people have ideas for getting more data (common things I hear include “use private platform data like messaging apps”), most of these don’t change the picture because they don’t move things more than an order of magnitude, and the scaling trends want more orders of magnitude, not merely 2x.
Repeated data is the type of thing that plausibly adds an order of magnitude or maybe more.
The point is that you need to get quantitative in these estimates to claim that data is running out, since it has to run out compared to available compute, not merely on its own. And the repeated data argument seems by itself sufficient to show that it doesn’t in fact run out in this sense.
Data still seems to be running out for overtrained models, which is a major concern for LLM labs, so from their point of view there is indeed a salient data wall that’s very soon going to become a problem. There are rumors of synthetic data (which often ambiguously gesture at post-training results while discussing the pre-training data wall), but no published research for how something like that improves the situation with pre-training over using repeated data.