Text data is running out, the 2+ billion dollar scale training runs due in 2-4 years are going to devour the rest of it. This might be sufficient to reach AGI, in the sense of capability for mostly autonomous research, in particular development of compute multipliers for training runs and plucking the rest of low hanging fruit of the unsupervised learning revolution, overcoming scarcity of hardware.
If AGI is not in range of those runs, and if there is no synthetic data generation process useful at that scale, the bulk of compute goes to multimodality (though latency will still cripple many use cases), and the rate of competence improvement may slow for years. This is the main scenario where I see significant hope for regulation to take hold. Doing better and countering the risk of AGI in the initial rush to billion dollar scale runs requires a nebulously defined pause right now.
One way to get more data is to pay humans to create the specific types of data we need. For example, if billion people write 100 pages on the unique topic of their expertise each—and the needed data generation will be controlled by AI—maybe that will be enough.
I’m somewhat skeptical that running out of text data will meaningfully slow progress. Today’s models are so sample inefficient compared with human brains that I suspect there are significant jumps possible there.
Also, as you say; - Synthetic text data might well be possible (especially for domains where you can test the quality of the produced text externally (e.g. programming) - Reinforcement-learning-style virtual environments can also generate data (and not necessarily only physics based environments either—it could be more like playing games or using a computer). - And multimodal inputs gives us a lot more data too, and I think we’ve only really scratched the surface of multimodal transformers today.
New untested ideas take unpredictable time to develop. Given the current timeline of pure compute/investment scaling, there is no particular reason for all bottlenecks to be cleared just in time for scaling to continue without slowing down. Hence the possibility of it slowing down at the upcoming possible bottlenecks of natural text data and available-on-short-notice hardware, which are somewhat close together.
Sample efficiency (with respect to natural data) can in principle be improved, humans and some RL systems show it’s possible, and synthetic data is a particular form this improvement might take. But it’s not something that’s readily available, known to subsume capabilities of LLMs and scale past them. Also, straying further from the LLM recipe of simulating human text might make alignment even more intractable. In a universe where alignment of LLMs is feasible within the current breakneck regime, the source of doom I worry about is an RL system that either didn’t train on human culture or did too much reflection to remain within its frame.
Compared to natural text, multimodal data and many recipes for synthetic data give something less valuable for improving model competence, reducing return on further scaling. When competence improvement slows down, and if AGI in the sense of human-level autonomous work remains sufficiently far away at that point, investment scaling is going to slow down as well. Future frontier models cost too much if there is no commensurate competence improvement.
My hunch is that there’s sufficient text already if an AI processes it more reflectively. For example, each chunk of text can be fed through a series of LLM prompts intended to enrich it, and then the model trains on the enriched/expanded text.
Text data is running out, the 2+ billion dollar scale training runs due in 2-4 years are going to devour the rest of it. This might be sufficient to reach AGI, in the sense of capability for mostly autonomous research, in particular development of compute multipliers for training runs and plucking the rest of low hanging fruit of the unsupervised learning revolution, overcoming scarcity of hardware.
If AGI is not in range of those runs, and if there is no synthetic data generation process useful at that scale, the bulk of compute goes to multimodality (though latency will still cripple many use cases), and the rate of competence improvement may slow for years. This is the main scenario where I see significant hope for regulation to take hold. Doing better and countering the risk of AGI in the initial rush to billion dollar scale runs requires a nebulously defined pause right now.
One way to get more data is to pay humans to create the specific types of data we need. For example, if billion people write 100 pages on the unique topic of their expertise each—and the needed data generation will be controlled by AI—maybe that will be enough.
I’m somewhat skeptical that running out of text data will meaningfully slow progress. Today’s models are so sample inefficient compared with human brains that I suspect there are significant jumps possible there.
Also, as you say;
- Synthetic text data might well be possible (especially for domains where you can test the quality of the produced text externally (e.g. programming)
- Reinforcement-learning-style virtual environments can also generate data (and not necessarily only physics based environments either—it could be more like playing games or using a computer).
- And multimodal inputs gives us a lot more data too, and I think we’ve only really scratched the surface of multimodal transformers today.
New untested ideas take unpredictable time to develop. Given the current timeline of pure compute/investment scaling, there is no particular reason for all bottlenecks to be cleared just in time for scaling to continue without slowing down. Hence the possibility of it slowing down at the upcoming possible bottlenecks of natural text data and available-on-short-notice hardware, which are somewhat close together.
Sample efficiency (with respect to natural data) can in principle be improved, humans and some RL systems show it’s possible, and synthetic data is a particular form this improvement might take. But it’s not something that’s readily available, known to subsume capabilities of LLMs and scale past them. Also, straying further from the LLM recipe of simulating human text might make alignment even more intractable. In a universe where alignment of LLMs is feasible within the current breakneck regime, the source of doom I worry about is an RL system that either didn’t train on human culture or did too much reflection to remain within its frame.
Compared to natural text, multimodal data and many recipes for synthetic data give something less valuable for improving model competence, reducing return on further scaling. When competence improvement slows down, and if AGI in the sense of human-level autonomous work remains sufficiently far away at that point, investment scaling is going to slow down as well. Future frontier models cost too much if there is no commensurate competence improvement.
My hunch is that there’s sufficient text already if an AI processes it more reflectively. For example, each chunk of text can be fed through a series of LLM prompts intended to enrich it, and then the model trains on the enriched/expanded text.