A lot of this argument seems to rest on the training-inference gap, allowing a very large population of AIs to exist at the same as cost as training. In that way they can be a formidable group even if the individual AIs are only human-level. I was suspicious of this at first, but I found myself largely coming round to it after sanity checking it using a slightly different method than biological anchors. However, if I understand correctly the biological anchors framework implies the gap between training and inference grows with capabilities. My projection instead expects it to grow a little in the next few years and then plateau as we hit the limits of data scaling. This suggests a more continuous picture: there will be a “population explosion” of AI systems in the next few years so to speak as we scale data, but then the “population size” (total number of tokens you can generate for your training budget) will stay more or less constant, while the quality of the generated tokens gradually increases.
To a first approximation, the amount of inference you can do at the same cost as training the system will equal the size of the training data multiplied by number of epochs. The trend in large language models seems to be to train for only 1 epoch on most data, and a handful of epochs for the highest-quality parts of the data. So as a rule of thumb: if you spend $X on training and $X on inference, you can produce as much data as your training dataset. Caveat: inference can be more expensive (e.g. beam search) or less expensive (e.g. distillation, specialized inference-only hardware) and depends on things like how much you care about latency; I think this only changes the picture by 10x either way.
Given that GPT-3 was trained on a significant fraction of the entire text available on the Internet (CommonCrawl), this would already be a really big deal if GPT-3 was actually close to human-level. Adding another Internets-worth of content would be… significant.
But conversely, the fact we’re already training on so much data limits how much room for growth there is. I’d estimate we have no more than 100-1000x left for language scaling. We could probably get up to 10x more from more comprehensive (but lower quality) crawls than CommonCrawl, and 10-100x more if tech companies use non-public datasets (e.g. all e-mails & docs on a cloud service).
By contrast, in principle compute could scale up a lot more than this. We can likely get 10-100x just from spending more on training runs. Hardware progress could easily deliver 1000x by 2036, the date chosen in this post.
Given this, at least under business as usual scaling I expect us to hit the limits of data scaling significantly before we exhaust compute scaling. So we’ll have larger and more compute-intensive models trained on relatively small datasets (although still massive in absolute terms). This suggests the training-inference gap grow a bit as we grow training data size, but soon plateau as we just scale up model size while keeping training data fixed.
One thing that could do undo this argument is if we end up training for many (say >10) epochs, or synthetically generate data, as a kind of poor-mans data scaling rather than just scaling up parameter count. I expect we’ll try this, but I’d only give it 30% odds it makes a big difference. I do think it’s more likely if we move away from the LM paradigm, and either get a lot of mileage out of multi-modal models (there’s lots more video data at least in terms of GB, maybe not in terms of abstract information content) or back towards RL (where data generated in simulation seems much more valuable and scalable).
Isn’t multi-epoch training most likely to lead to overfitting, making the models less useful/powerful?
If it were possible to write an algorithm to generate this synthetic training data how would the resulting training data have any more information content than the algorithm that produced it? Sure, you’d get an enormous increase in training text volume, but large volumes of training data containing small amounts of information seems counterproductive for training purposes—it will just bias the model disproportionately toward that small amount of information.
A lot of this argument seems to rest on the training-inference gap, allowing a very large population of AIs to exist at the same as cost as training. In that way they can be a formidable group even if the individual AIs are only human-level. I was suspicious of this at first, but I found myself largely coming round to it after sanity checking it using a slightly different method than biological anchors. However, if I understand correctly the biological anchors framework implies the gap between training and inference grows with capabilities. My projection instead expects it to grow a little in the next few years and then plateau as we hit the limits of data scaling. This suggests a more continuous picture: there will be a “population explosion” of AI systems in the next few years so to speak as we scale data, but then the “population size” (total number of tokens you can generate for your training budget) will stay more or less constant, while the quality of the generated tokens gradually increases.
To a first approximation, the amount of inference you can do at the same cost as training the system will equal the size of the training data multiplied by number of epochs. The trend in large language models seems to be to train for only 1 epoch on most data, and a handful of epochs for the highest-quality parts of the data. So as a rule of thumb: if you spend $X on training and $X on inference, you can produce as much data as your training dataset. Caveat: inference can be more expensive (e.g. beam search) or less expensive (e.g. distillation, specialized inference-only hardware) and depends on things like how much you care about latency; I think this only changes the picture by 10x either way.
Given that GPT-3 was trained on a significant fraction of the entire text available on the Internet (CommonCrawl), this would already be a really big deal if GPT-3 was actually close to human-level. Adding another Internets-worth of content would be… significant.
But conversely, the fact we’re already training on so much data limits how much room for growth there is. I’d estimate we have no more than 100-1000x left for language scaling. We could probably get up to 10x more from more comprehensive (but lower quality) crawls than CommonCrawl, and 10-100x more if tech companies use non-public datasets (e.g. all e-mails & docs on a cloud service).
By contrast, in principle compute could scale up a lot more than this. We can likely get 10-100x just from spending more on training runs. Hardware progress could easily deliver 1000x by 2036, the date chosen in this post.
Given this, at least under business as usual scaling I expect us to hit the limits of data scaling significantly before we exhaust compute scaling. So we’ll have larger and more compute-intensive models trained on relatively small datasets (although still massive in absolute terms). This suggests the training-inference gap grow a bit as we grow training data size, but soon plateau as we just scale up model size while keeping training data fixed.
One thing that could do undo this argument is if we end up training for many (say >10) epochs, or synthetically generate data, as a kind of poor-mans data scaling rather than just scaling up parameter count. I expect we’ll try this, but I’d only give it 30% odds it makes a big difference. I do think it’s more likely if we move away from the LM paradigm, and either get a lot of mileage out of multi-modal models (there’s lots more video data at least in terms of GB, maybe not in terms of abstract information content) or back towards RL (where data generated in simulation seems much more valuable and scalable).
Isn’t multi-epoch training most likely to lead to overfitting, making the models less useful/powerful?
If it were possible to write an algorithm to generate this synthetic training data how would the resulting training data have any more information content than the algorithm that produced it? Sure, you’d get an enormous increase in training text volume, but large volumes of training data containing small amounts of information seems counterproductive for training purposes—it will just bias the model disproportionately toward that small amount of information.