Yes, good point Josh. If the biggest labs had been pushing as fast as possible, they could have a next model by now.
I don’t have a definite answer to this, but I have some guesses.
It could be a combination of any of these.
Keeping up with inference demand, as Josh mentioned
Wanting to focus on things other than getting the next big model out ASAP:
multimodality (e.g. GPT-4o), better versions of cheaper smaller models (e.g. Sonnet 3.5, Gemini Flash), non-capabilites work like safety or watermarking
choosing to put more time and effort into improving the data /code/ training process which will be used for the next large model run. Potentially including: smaller scale experiments to test ideas, cleaning data, improving synthetic data generation (strawberry?), gathering new data to cover specific weak spots (perhaps by paying people to create it), developing and testing better engineering infrastructure to support larger runs
wanting to spend extra time evaluating performance of the checkpoints partway through training to make sure everything is working as expected. Larger scale means mistakes are much more costly. Mistakes caught early in the training process are less costly overall.
wanting to spend more time and effort evaluating the final product. There were several months where GPT-4 existed internally and got tested in a bunch of different ways. Nathan Labenz tells interesting stories of his time as a pre-release tester. Hopefully, with the new larger generation of models the companies will spend even more time and effort evaluating the new capabilities. If they scaled up their evaluation time from 6-8 months to 12-18 months , then we’d expect that much additional delay. We would only see a new next-gen model publicly right now if they had started on it ASAP and then completely skipped the safety testing. I really hope no companies choose to skip safety testing!
if safety and quality testing is done (as I expect it will be), then flaws found could require additional time and effort to correct. I would expect multiple rounds of test—fine-tune—test—fine-tune before the final product is deemed suitable for release.
even after the product is deemed ready, there may be reasons for further delaying the release. These might include: deciding to focus on using the new model to distill the next generation of smaller cheaper models and wanting to be able to release all of them together as a set, waiting for a particularly dramatic or appropriate time to release in order to maximize expected public impact, wanting to scale/test/robustify their inference pipeline to make sure they’ll be able to handle the anticipated release-day surge, wanting to check if the model seems so good at recursive self-improvement that they need to dumb the public version down in order not to hand their competitors an advantage from using for ML research (could include making sure the model can’t replicate secret internal techniques, or even potentially poisoning the model with false information which would set competitors back).
Yes, good point Josh. If the biggest labs had been pushing as fast as possible, they could have a next model by now. I don’t have a definite answer to this, but I have some guesses. It could be a combination of any of these.
Keeping up with inference demand, as Josh mentioned
Wanting to focus on things other than getting the next big model out ASAP: multimodality (e.g. GPT-4o), better versions of cheaper smaller models (e.g. Sonnet 3.5, Gemini Flash), non-capabilites work like safety or watermarking
choosing to put more time and effort into improving the data /code/ training process which will be used for the next large model run. Potentially including: smaller scale experiments to test ideas, cleaning data, improving synthetic data generation (strawberry?), gathering new data to cover specific weak spots (perhaps by paying people to create it), developing and testing better engineering infrastructure to support larger runs
wanting to spend extra time evaluating performance of the checkpoints partway through training to make sure everything is working as expected. Larger scale means mistakes are much more costly. Mistakes caught early in the training process are less costly overall.
wanting to spend more time and effort evaluating the final product. There were several months where GPT-4 existed internally and got tested in a bunch of different ways. Nathan Labenz tells interesting stories of his time as a pre-release tester. Hopefully, with the new larger generation of models the companies will spend even more time and effort evaluating the new capabilities. If they scaled up their evaluation time from 6-8 months to 12-18 months , then we’d expect that much additional delay. We would only see a new next-gen model publicly right now if they had started on it ASAP and then completely skipped the safety testing. I really hope no companies choose to skip safety testing!
if safety and quality testing is done (as I expect it will be), then flaws found could require additional time and effort to correct. I would expect multiple rounds of test—fine-tune—test—fine-tune before the final product is deemed suitable for release.
even after the product is deemed ready, there may be reasons for further delaying the release. These might include: deciding to focus on using the new model to distill the next generation of smaller cheaper models and wanting to be able to release all of them together as a set, waiting for a particularly dramatic or appropriate time to release in order to maximize expected public impact, wanting to scale/test/robustify their inference pipeline to make sure they’ll be able to handle the anticipated release-day surge, wanting to check if the model seems so good at recursive self-improvement that they need to dumb the public version down in order not to hand their competitors an advantage from using for ML research (could include making sure the model can’t replicate secret internal techniques, or even potentially poisoning the model with false information which would set competitors back).