the data bottleneck that threatens to strangle scaling
There is no data bottleneck (for data that’s not necessarily high quality), because data can be repeated in training, about 4 times without much difference compared to unique data, up to about 16 times while still significantly improving the model. This was notably used in Galactica (see Figure 6), published Nov 2022, then there was the systematic study of scaling laws for repeated data from May 2023, recently repeated data was applied in StarCoder 2 (Feb 2024).
A Chinchilla optimal model uses a model size proportional to dataset size, meaning compute is proportional to data squared. If you repeat data 16 times, this means finding a use for 256 times more compute. A filtered and deduplicated CommonCrawl text dataset RedPajama-Data-v2 has 30 trillion tokens. If repeated 16 times with a Chinchilla optimal monolithic Transformer, it would use about 7e28 FLOPs of compute. This scales with data squared, if there is more data to be found, which there certainly is, even if not OOMs more. Assuming BF16 training with 30% utilization, this would require 3.2e10 H100-hours, which assuming $2/hour takes about $65 billion. Anchoring to the rumored 2e25 FLOPs GPT-4 run at $100 million instead, this gives $350 billion. Both numbers are likely currently outside commercial feasibility, if smaller models fail to demonstrate sufficiently impressive feats. And there’s still that further quadratic scaling of needed compute with more data than 30 trillion tokens. (Though Microscaling in Blackwell might reduce the cost of effective compute more than otherwise could be expected this soon.)
There is no data bottleneck (for data that’s not necessarily high quality), because data can be repeated in training, about 4 times without much difference compared to unique data, up to about 16 times while still significantly improving the model. This was notably used in Galactica (see Figure 6), published Nov 2022, then there was the systematic study of scaling laws for repeated data from May 2023, recently repeated data was applied in StarCoder 2 (Feb 2024).
A Chinchilla optimal model uses a model size proportional to dataset size, meaning compute is proportional to data squared. If you repeat data 16 times, this means finding a use for 256 times more compute. A filtered and deduplicated CommonCrawl text dataset RedPajama-Data-v2 has 30 trillion tokens. If repeated 16 times with a Chinchilla optimal monolithic Transformer, it would use about 7e28 FLOPs of compute. This scales with data squared, if there is more data to be found, which there certainly is, even if not OOMs more. Assuming BF16 training with 30% utilization, this would require 3.2e10 H100-hours, which assuming $2/hour takes about $65 billion. Anchoring to the rumored 2e25 FLOPs GPT-4 run at $100 million instead, this gives $350 billion. Both numbers are likely currently outside commercial feasibility, if smaller models fail to demonstrate sufficiently impressive feats. And there’s still that further quadratic scaling of needed compute with more data than 30 trillion tokens. (Though Microscaling in Blackwell might reduce the cost of effective compute more than otherwise could be expected this soon.)