LLMs will soon scale beyond the available natural text data, and generation of synthetic data is some sort of change of architecture, potentially a completely different source of capabilities. So scaling LLMs without change of architecture much further is an expectation about something counterfactual. It makes sense as a matter of theory, but it’s not relevant for forecasting.
The main result is that up to 4 repetitions are about as good as unique data,
and for up to about 16 repetitions there is still meaningful improvement.
Let’s take 50T tokens as an estimate for available text data
(as an anchor, there’s a filtered and deduplicated
CommonCrawl dataset RedPajama-Data-v2
with 30T tokens).
Repeated 4 times, it can make good use of 1e28 FLOPs (with a dense transformer),
and repeated 16 times, suboptimal but meaningful use of 2e29 FLOPs.
So this is close but not lower than what can be put to use within a few years.
Thanks for pushing back on the original claim.
Three points: how much compute is going into a training run,
how much natural text data it wants, and how much data is available.
For training compute, there are claims of multi-billion dollar runs being
plausible and possibly planned in 2-5 years.
Eyeballing various trends and GPU shipping numbers and revenues,
it looks like about 3 OOMs of compute scaling is possible
before industrial capacity constrains the trend and the scaling slows down.
This assumes that there are no overly dramatic profits from AI
(which might lead to finding ways of scaling supply chains faster than usual),
and no overly dramatic lack of new capabilities with further scaling
(which would slow down investment in scaling).
That gives about 1e28-1e29 FLOPs at the slowdown in 4-6 years.
At 1e28 FLOPs, Chinchilla scaling asks for 200T-250T tokens.
Various sparsity techniques increase effective compute,
asking for even more tokens
(when optimizing loss given fixed hardware compute). Edit 15 Dec: I no longer endorse this point, based on scaling laws for training on repeated data.
On the outside, there are 20M-150M accessible books,
some text from video,
and 1T web pages of extremely dubious uniqueness and quality.
That might give about 100T tokens, if LLMs are used to curate?
There’s some discussion (incl. comments) here,
this is the figure I’m most uncertain about.
In practice, absent good synthetic data,
I expect multimodality to fill the gap,
but that’s not going to be as useful as good text
for improving chatbot competence.
(Possibly the issue with the original claim in the grandparent
is what I meant by “soon”.)
LLMs will soon scale beyond the available natural text data, and generation of synthetic data is some sort of change of architecture, potentially a completely different source of capabilities. So scaling LLMs without change of architecture much further is an expectation about something counterfactual. It makes sense as a matter of theory, but it’s not relevant for forecasting.
Edit 15 Dec: No longer endorsed based on scaling laws for training on repeated data.
Bold claim. Want to make any concrete predictions so that I can register my different beliefs?
I’ve now changed my mind based on
N Muennighoff et al. (2023) Scaling Data-Constrained Language Models
The main result is that up to 4 repetitions are about as good as unique data, and for up to about 16 repetitions there is still meaningful improvement. Let’s take 50T tokens as an estimate for available text data (as an anchor, there’s a filtered and deduplicated CommonCrawl dataset RedPajama-Data-v2 with 30T tokens). Repeated 4 times, it can make good use of 1e28 FLOPs (with a dense transformer), and repeated 16 times, suboptimal but meaningful use of 2e29 FLOPs. So this is close but not lower than what can be put to use within a few years. Thanks for pushing back on the original claim.
Three points: how much compute is going into a training run, how much natural text data it wants, and how much data is available. For training compute, there are claims of multi-billion dollar runs being plausible and possibly planned in 2-5 years. Eyeballing various trends and GPU shipping numbers and revenues, it looks like about 3 OOMs of compute scaling is possible before industrial capacity constrains the trend and the scaling slows down. This assumes that there are no overly dramatic profits from AI (which might lead to finding ways of scaling supply chains faster than usual), and no overly dramatic lack of new capabilities with further scaling (which would slow down investment in scaling). That gives about 1e28-1e29 FLOPs at the slowdown in 4-6 years.
At 1e28 FLOPs, Chinchilla scaling asks for 200T-250T tokens. Various sparsity techniques increase effective compute, asking for even more tokens (when optimizing loss given fixed hardware compute).
Edit 15 Dec: I no longer endorse this point, based on scaling laws for training on repeated data.
On the outside, there are 20M-150M accessible books, some text from video, and 1T web pages of extremely dubious uniqueness and quality. That might give about 100T tokens, if LLMs are used to curate? There’s some discussion (incl. comments) here, this is the figure I’m most uncertain about. In practice, absent good synthetic data, I expect multimodality to fill the gap, but that’s not going to be as useful as good text for improving chatbot competence. (Possibly the issue with the original claim in the grandparent is what I meant by “soon”.)