Three points: how much compute is going into a training run,
how much natural text data it wants, and how much data is available.
For training compute, there are claims of multi-billion dollar runs being
plausible and possibly planned in 2-5 years.
Eyeballing various trends and GPU shipping numbers and revenues,
it looks like about 3 OOMs of compute scaling is possible
before industrial capacity constrains the trend and the scaling slows down.
This assumes that there are no overly dramatic profits from AI
(which might lead to finding ways of scaling supply chains faster than usual),
and no overly dramatic lack of new capabilities with further scaling
(which would slow down investment in scaling).
That gives about 1e28-1e29 FLOPs at the slowdown in 4-6 years.
At 1e28 FLOPs, Chinchilla scaling asks for 200T-250T tokens.
Various sparsity techniques increase effective compute,
asking for even more tokens
(when optimizing loss given fixed hardware compute). Edit 15 Dec: I no longer endorse this point, based on scaling laws for training on repeated data.
On the outside, there are 20M-150M accessible books,
some text from video,
and 1T web pages of extremely dubious uniqueness and quality.
That might give about 100T tokens, if LLMs are used to curate?
There’s some discussion (incl. comments) here,
this is the figure I’m most uncertain about.
In practice, absent good synthetic data,
I expect multimodality to fill the gap,
but that’s not going to be as useful as good text
for improving chatbot competence.
(Possibly the issue with the original claim in the grandparent
is what I meant by “soon”.)
Three points: how much compute is going into a training run, how much natural text data it wants, and how much data is available. For training compute, there are claims of multi-billion dollar runs being plausible and possibly planned in 2-5 years. Eyeballing various trends and GPU shipping numbers and revenues, it looks like about 3 OOMs of compute scaling is possible before industrial capacity constrains the trend and the scaling slows down. This assumes that there are no overly dramatic profits from AI (which might lead to finding ways of scaling supply chains faster than usual), and no overly dramatic lack of new capabilities with further scaling (which would slow down investment in scaling). That gives about 1e28-1e29 FLOPs at the slowdown in 4-6 years.
At 1e28 FLOPs, Chinchilla scaling asks for 200T-250T tokens. Various sparsity techniques increase effective compute, asking for even more tokens (when optimizing loss given fixed hardware compute).
Edit 15 Dec: I no longer endorse this point, based on scaling laws for training on repeated data.
On the outside, there are 20M-150M accessible books, some text from video, and 1T web pages of extremely dubious uniqueness and quality. That might give about 100T tokens, if LLMs are used to curate? There’s some discussion (incl. comments) here, this is the figure I’m most uncertain about. In practice, absent good synthetic data, I expect multimodality to fill the gap, but that’s not going to be as useful as good text for improving chatbot competence. (Possibly the issue with the original claim in the grandparent is what I meant by “soon”.)