1. Are the historically leading AI organizations—OpenAI, Anthropic, and Google—holding back their best models, making it appear as though there’s more parity in the market than there actually is?
4. Is it plausible that once LLMs were validated and the core idea spread, it became surprisingly simple to build, allowing any company to quickly reach the frontier?
I’ll try to address just these two points with what I know, which is limited.
OpenAI, Anthropic, Google / Google Deepmind, Meta, AWS, and NVIDIA are the main current holders of AI compute hardware. Of these, the majority of frontier engineering talent (and expressed intent to race for AGI) seems to be concentrated in OpenAI, Anthropic, and Google / Google Deepmind. I think of these as a group I call ‘the Top 3’.
So far as the public knows, the Top 3 are working hard on preparing the next generation of LLMs (now all multimodal, so LLM is a bit of a misnomer). Preparing a next generation takes time and effort from the researchers, plus engineering support and training time on the large clusters. We are roughly halfway through the expected interval between full-step versions (e.g. GPT-4 to GPT-5). Of these companies, my best guess is that OpenAI has a slight lead, and thus will probably deploy their next gen (i.e. GPT-5) before the other companies deploy theirs (Claude 4 Opus, Google Gemini 2 Ultra). The time lag may be a couple of weeks or as much as six months. Hard to say for sure. Google has a lot of resources, and so might slightly beat OpenAI.
In any case, the big differentiator in terms of the Top 3 versus the Rest is the massive amount of hardware. GPT-4 level was something that could be trained on a wide variety of different rentable resources. The scaling trend suggests that the next level up, GPT-5, will require more resources than the Rest are expected to be able to muster this year. Since hardware is advancing, some members of the Rest may acquire GPT-5 level resources late next year (2025) or early 2026. This means they’ll be quite a few months behind. As for the implied resources needed for GPT-6 level if scaling trends and costs continue on trend, it seems unlikely that most of the Rest will be able to afford to scale that large (even acting late, and with rented resources) with the exception of Meta and possibly xAI.
Current opinion, which I agree with, puts Claude 3.5 Sonnet at about 4.25 − 4.3 level compared to GPT-4. All of the others (including GPT-4o) fall somewhere in-between level 4 and level 4.3
Nobody has a level 5 yet because even the leaders haven’t had the time to create and deploy it yet!
As for “secret sauce” ideas, we haven’t yet gotten public knowledge about any secret knowledge being a hard blocker. It does seem like there is a fair amount of small technical secrets which improve compute efficiency or capabilities in minor ways, but that these can all be compensated for by spending more on compute or coming up with alternate approaches. There is a huge amount of ML research being published every month now because the field has gotten so lucrative and trendy. The newest public research isn’t yet present in the deployed models because the models were trained and deployed before the new research was published!
This makes for a strange dynamic that those companies which are lagging slightly behind in getting their large training runs started get the advantage of later cut-off date for incorporation of public research. I think that getting to incorporate more of the latest research is a significant part of the explanation for why the GPT-4-sized models of late-movers have slightly surpassed GPT-4.
This doesn’t imply that OpenAI is losing the race, or that they don’t have valuable technical secrets. The Top 3 are still in the lead, so far as I can foresee, they just are in the ‘hidden progress’ phase which comes between model generations. Because of this, we can’t know their relative standing for certain. Presumably, even they don’t know since they don’t have the details on the secret tech that their competitors are putting into their next generation. We will need to wait and see how the next generation of the Top 3′s models compare to each other.
Unclear if going beyond GPT-5 will be crucial, at that point researchers might get more relevant than compute again. GPT-4 level models (especially the newer ones) have the capability to understand complicated non-specialized text (now I can be certain some of my more obscure comments are Objectively Understandable), so GPT-5 level models will understand very robustly. If this is sufficient signal to get RL-like things off the ground (automating most labeling with superhuman quality, usefully scaling post-training to the level of pre-training), more scale won’t necessarily help on the currently-somewhat-routine pre-training side.
I think a little more explanation is required on why there isn’t already a model with 5-10x* more compute than GPT-4 (which would be “4.5 level” given that GPT version numbers have historically gone up by 1 for every two OOMs, though I think the model literally called GPT-5 will only be a roughly 10x scale-up).
You’d need around 100,000 H100s (or maybe somewhat fewer; Llama 3.1 was 2x GPT-4 and trained using 16,000 H100s) to train a model at 10x GPT-4. This has been available to the biggest hyperscalers since sometime last year. Naively it might take ~9 months from taking delivery of chips to releasing a model (perhaps 3 months to set up the cluster, 3 months for pre-training, 3 months of post-training, evaluations, etc). But most likely the engineering challenges in building a cluster that big, which is unprecedented, and perhaps high demand for inference, has prevented them from concentrating that much compute into one training run in time to release a model by now.
*I’m not totally sure the 5x threshold (1e26 FLOP) hasn’t been breached but most people think it hasn’t.
GPT-4 (Mar 2023 version) is rumored to have been trained on 25K A100s for 2e25 FLOPs, and Gemini 1.0 Ultra on TPUv4s (this detail is in the report) for 1e26 FLOPs. In BF16, A100s give 300 teraFLOP/s, TPUv4s 270 teraFLOP/s, H100s 1000 teraFLOP/s (marketing materials say 2000 teraFLOP/s, but that’s for sparse computation that isn’t relevant for training). So H100s have 3x advantage over hardware that trained GPT-4 and Gemini 1.0 Ultra. Llama-3-405b was trained on 16K H100s for about 2 months, getting 4e25 BF16 FLOPs at 40% compute utilization.
So assuming AWS’s offer is real for the purpose of training a single model on the whole 20K H100s cluster and was sufficiently liquid, for a year now 6 months of training could have yielded a 1.2e26 FLOPs model, which is 6x GPT-4, 3x Llama-3-405b, or on par with Gemini 1.0 Ultra. But much more than that wasn’t yet possible, not without running multiple such clusters in parallel, using geographically distributed training with low communication between clusters, something like DiLoCo. Now that 100K H100s clusters are getting online, 6 months of training will be giving about 5e26 FLOPs (assuming 30% utilization and that FP8 still couldn’t be made to work for training models at this scale).
Do you have a citation for the claim that Gemini 1.0 Ultra trained for 1e26 FLOPs? I had searched all around but can’t find any information on its compute cost.
I originally saw the estimate from EpochAI, which I think was either 8e25 FLOPs or 1e26 FLOPs, but I’m either misremembering or they changed the estimate, since currently they list 5e25 FLOPs (background info for a metaculus question claims the Epoch estimate was 9e25 FLOPs in Feb 2024). In Jun 2024, SemiAnalysis posted a plot with a dot for Gemini Ultra (very beginning of this post) where it’s placed at 7e25 FLOPs (they also slightly overestimate Llama-3-405B at 5e25 FLOPs, which wasn’t yet released then).
This number is an estimate based on limited evidence. In particular, we combine information about the performance of Gemini Ultra on various benchmarks compared to other models, and guesstimates about the hardware setup used for training to arrive at our estimate. Our reasoning and calculations are detailed in this Colab notebook.
https://colab.research.google.com/drive/1sfG91UfiYpEYnj_xB5YRy07T5dv-9O_c
Among other clues, the Colab notebook cites Gemini 1.0 report on use of TPUv4 in pods of 4096 across multiple datacenters for Gemini Ultra, claims that SemiAnalysis claims that Gemini Ultra could have been trained on 7+7 pods (which is 57K TPUs), and cites an article from The Information (paywalled):
Unlike OpenAI, which relied on Microsoft’s servers, Google operated its own data centers. It had even built its own specialized Al chip, the tensor processing unit, to run its software more efficiently. And it had amassed a staggering number of those chips for the Gemini effort-77,000 of the fourth-generation TPU, code-named Pufferfish.
One TPUv4 offers 275e12 FLOP/s, so at 40% MFU this gives 1.6e25 FLOPs a month by SemiAnalysis estimate on number of pods and 2.2e25 FLOPs a month by The Information’s claim on number of TPUs.
They arrive at a 6e25 FLOPs as the point estimate from hardware considerations. The training duration range is listed as 3-6 months before the code, but it’s actually 1-6 months in the code, so one of these is a bug. If we put 3-6 months in the code, their point estimate becomes 1e26 FLOPs. They also assume MFU of 40-60%, which seems too high to me.
If their claim of 7+7 pods from SemiAnalysis is combined with the 7e25 FLOPs estimate from the SemiAnalysis plot, this suggests training time of 4 months. At that duration, but with TPU count claim from The Information, we get 9e27 FLOPs. So after considering Epoch’s clues, I’m settling at 8e25 FLOPs as my own point estimate.
Yes, good point Josh. If the biggest labs had been pushing as fast as possible, they could have a next model by now.
I don’t have a definite answer to this, but I have some guesses.
It could be a combination of any of these.
Keeping up with inference demand, as Josh mentioned
Wanting to focus on things other than getting the next big model out ASAP:
multimodality (e.g. GPT-4o), better versions of cheaper smaller models (e.g. Sonnet 3.5, Gemini Flash), non-capabilites work like safety or watermarking
choosing to put more time and effort into improving the data /code/ training process which will be used for the next large model run. Potentially including: smaller scale experiments to test ideas, cleaning data, improving synthetic data generation (strawberry?), gathering new data to cover specific weak spots (perhaps by paying people to create it), developing and testing better engineering infrastructure to support larger runs
wanting to spend extra time evaluating performance of the checkpoints partway through training to make sure everything is working as expected. Larger scale means mistakes are much more costly. Mistakes caught early in the training process are less costly overall.
wanting to spend more time and effort evaluating the final product. There were several months where GPT-4 existed internally and got tested in a bunch of different ways. Nathan Labenz tells interesting stories of his time as a pre-release tester. Hopefully, with the new larger generation of models the companies will spend even more time and effort evaluating the new capabilities. If they scaled up their evaluation time from 6-8 months to 12-18 months , then we’d expect that much additional delay. We would only see a new next-gen model publicly right now if they had started on it ASAP and then completely skipped the safety testing. I really hope no companies choose to skip safety testing!
if safety and quality testing is done (as I expect it will be), then flaws found could require additional time and effort to correct. I would expect multiple rounds of test—fine-tune—test—fine-tune before the final product is deemed suitable for release.
even after the product is deemed ready, there may be reasons for further delaying the release. These might include: deciding to focus on using the new model to distill the next generation of smaller cheaper models and wanting to be able to release all of them together as a set, waiting for a particularly dramatic or appropriate time to release in order to maximize expected public impact, wanting to scale/test/robustify their inference pipeline to make sure they’ll be able to handle the anticipated release-day surge, wanting to check if the model seems so good at recursive self-improvement that they need to dumb the public version down in order not to hand their competitors an advantage from using for ML research (could include making sure the model can’t replicate secret internal techniques, or even potentially poisoning the model with false information which would set competitors back).
I’ll try to address just these two points with what I know, which is limited.
OpenAI, Anthropic, Google / Google Deepmind, Meta, AWS, and NVIDIA are the main current holders of AI compute hardware. Of these, the majority of frontier engineering talent (and expressed intent to race for AGI) seems to be concentrated in OpenAI, Anthropic, and Google / Google Deepmind. I think of these as a group I call ‘the Top 3’.
So far as the public knows, the Top 3 are working hard on preparing the next generation of LLMs (now all multimodal, so LLM is a bit of a misnomer). Preparing a next generation takes time and effort from the researchers, plus engineering support and training time on the large clusters. We are roughly halfway through the expected interval between full-step versions (e.g. GPT-4 to GPT-5). Of these companies, my best guess is that OpenAI has a slight lead, and thus will probably deploy their next gen (i.e. GPT-5) before the other companies deploy theirs (Claude 4 Opus, Google Gemini 2 Ultra). The time lag may be a couple of weeks or as much as six months. Hard to say for sure. Google has a lot of resources, and so might slightly beat OpenAI.
In any case, the big differentiator in terms of the Top 3 versus the Rest is the massive amount of hardware. GPT-4 level was something that could be trained on a wide variety of different rentable resources. The scaling trend suggests that the next level up, GPT-5, will require more resources than the Rest are expected to be able to muster this year. Since hardware is advancing, some members of the Rest may acquire GPT-5 level resources late next year (2025) or early 2026. This means they’ll be quite a few months behind. As for the implied resources needed for GPT-6 level if scaling trends and costs continue on trend, it seems unlikely that most of the Rest will be able to afford to scale that large (even acting late, and with rented resources) with the exception of Meta and possibly xAI.
Current opinion, which I agree with, puts Claude 3.5 Sonnet at about 4.25 − 4.3 level compared to GPT-4. All of the others (including GPT-4o) fall somewhere in-between level 4 and level 4.3
Nobody has a level 5 yet because even the leaders haven’t had the time to create and deploy it yet!
As for “secret sauce” ideas, we haven’t yet gotten public knowledge about any secret knowledge being a hard blocker. It does seem like there is a fair amount of small technical secrets which improve compute efficiency or capabilities in minor ways, but that these can all be compensated for by spending more on compute or coming up with alternate approaches. There is a huge amount of ML research being published every month now because the field has gotten so lucrative and trendy. The newest public research isn’t yet present in the deployed models because the models were trained and deployed before the new research was published!
This makes for a strange dynamic that those companies which are lagging slightly behind in getting their large training runs started get the advantage of later cut-off date for incorporation of public research. I think that getting to incorporate more of the latest research is a significant part of the explanation for why the GPT-4-sized models of late-movers have slightly surpassed GPT-4.
This doesn’t imply that OpenAI is losing the race, or that they don’t have valuable technical secrets. The Top 3 are still in the lead, so far as I can foresee, they just are in the ‘hidden progress’ phase which comes between model generations. Because of this, we can’t know their relative standing for certain. Presumably, even they don’t know since they don’t have the details on the secret tech that their competitors are putting into their next generation. We will need to wait and see how the next generation of the Top 3′s models compare to each other.
Unclear if going beyond GPT-5 will be crucial, at that point researchers might get more relevant than compute again. GPT-4 level models (especially the newer ones) have the capability to understand complicated non-specialized text (now I can be certain some of my more obscure comments are Objectively Understandable), so GPT-5 level models will understand very robustly. If this is sufficient signal to get RL-like things off the ground (automating most labeling with superhuman quality, usefully scaling post-training to the level of pre-training), more scale won’t necessarily help on the currently-somewhat-routine pre-training side.
I think a little more explanation is required on why there isn’t already a model with 5-10x* more compute than GPT-4 (which would be “4.5 level” given that GPT version numbers have historically gone up by 1 for every two OOMs, though I think the model literally called GPT-5 will only be a roughly 10x scale-up).
You’d need around 100,000 H100s (or maybe somewhat fewer; Llama 3.1 was 2x GPT-4 and trained using 16,000 H100s) to train a model at 10x GPT-4. This has been available to the biggest hyperscalers since sometime last year. Naively it might take ~9 months from taking delivery of chips to releasing a model (perhaps 3 months to set up the cluster, 3 months for pre-training, 3 months of post-training, evaluations, etc). But most likely the engineering challenges in building a cluster that big, which is unprecedented, and perhaps high demand for inference, has prevented them from concentrating that much compute into one training run in time to release a model by now.
*I’m not totally sure the 5x threshold (1e26 FLOP) hasn’t been breached but most people think it hasn’t.
GPT-4 (Mar 2023 version) is rumored to have been trained on 25K A100s for 2e25 FLOPs, and Gemini 1.0 Ultra on TPUv4s (this detail is in the report) for 1e26 FLOPs. In BF16, A100s give 300 teraFLOP/s, TPUv4s 270 teraFLOP/s, H100s 1000 teraFLOP/s (marketing materials say 2000 teraFLOP/s, but that’s for sparse computation that isn’t relevant for training). So H100s have 3x advantage over hardware that trained GPT-4 and Gemini 1.0 Ultra. Llama-3-405b was trained on 16K H100s for about 2 months, getting 4e25 BF16 FLOPs at 40% compute utilization.
With 100K H100s, 1 month at 30% utilization gets you 8e25 FLOPs. OpenAI might have obtained this kind of training compute in May 2024, and xAI might get it at the end of 2024. AWS announced access to clusters with 20K H100s back in July 2023, which is 2e25 FLOPs a month at 40% utilization.
So assuming AWS’s offer is real for the purpose of training a single model on the whole 20K H100s cluster and was sufficiently liquid, for a year now 6 months of training could have yielded a 1.2e26 FLOPs model, which is 6x GPT-4, 3x Llama-3-405b, or on par with Gemini 1.0 Ultra. But much more than that wasn’t yet possible, not without running multiple such clusters in parallel, using geographically distributed training with low communication between clusters, something like DiLoCo. Now that 100K H100s clusters are getting online, 6 months of training will be giving about 5e26 FLOPs (assuming 30% utilization and that FP8 still couldn’t be made to work for training models at this scale).
Do you have a citation for the claim that Gemini 1.0 Ultra trained for 1e26 FLOPs? I had searched all around but can’t find any information on its compute cost.
I originally saw the estimate from EpochAI, which I think was either 8e25 FLOPs or 1e26 FLOPs, but I’m either misremembering or they changed the estimate, since currently they list 5e25 FLOPs (background info for a metaculus question claims the Epoch estimate was 9e25 FLOPs in Feb 2024). In Jun 2024, SemiAnalysis posted a plot with a dot for Gemini Ultra (very beginning of this post) where it’s placed at 7e25 FLOPs (they also slightly overestimate Llama-3-405B at 5e25 FLOPs, which wasn’t yet released then).
The current notes for the EpochAI estimate are linked from the model database csv file:
Among other clues, the Colab notebook cites Gemini 1.0 report on use of TPUv4 in pods of 4096 across multiple datacenters for Gemini Ultra, claims that SemiAnalysis claims that Gemini Ultra could have been trained on 7+7 pods (which is 57K TPUs), and cites an article from The Information (paywalled):
One TPUv4 offers 275e12 FLOP/s, so at 40% MFU this gives 1.6e25 FLOPs a month by SemiAnalysis estimate on number of pods and 2.2e25 FLOPs a month by The Information’s claim on number of TPUs.
They arrive at a 6e25 FLOPs as the point estimate from hardware considerations. The training duration range is listed as 3-6 months before the code, but it’s actually 1-6 months in the code, so one of these is a bug. If we put 3-6 months in the code, their point estimate becomes 1e26 FLOPs. They also assume MFU of 40-60%, which seems too high to me.
If their claim of 7+7 pods from SemiAnalysis is combined with the 7e25 FLOPs estimate from the SemiAnalysis plot, this suggests training time of 4 months. At that duration, but with TPU count claim from The Information, we get 9e27 FLOPs. So after considering Epoch’s clues, I’m settling at 8e25 FLOPs as my own point estimate.
Yes, good point Josh. If the biggest labs had been pushing as fast as possible, they could have a next model by now. I don’t have a definite answer to this, but I have some guesses. It could be a combination of any of these.
Keeping up with inference demand, as Josh mentioned
Wanting to focus on things other than getting the next big model out ASAP: multimodality (e.g. GPT-4o), better versions of cheaper smaller models (e.g. Sonnet 3.5, Gemini Flash), non-capabilites work like safety or watermarking
choosing to put more time and effort into improving the data /code/ training process which will be used for the next large model run. Potentially including: smaller scale experiments to test ideas, cleaning data, improving synthetic data generation (strawberry?), gathering new data to cover specific weak spots (perhaps by paying people to create it), developing and testing better engineering infrastructure to support larger runs
wanting to spend extra time evaluating performance of the checkpoints partway through training to make sure everything is working as expected. Larger scale means mistakes are much more costly. Mistakes caught early in the training process are less costly overall.
wanting to spend more time and effort evaluating the final product. There were several months where GPT-4 existed internally and got tested in a bunch of different ways. Nathan Labenz tells interesting stories of his time as a pre-release tester. Hopefully, with the new larger generation of models the companies will spend even more time and effort evaluating the new capabilities. If they scaled up their evaluation time from 6-8 months to 12-18 months , then we’d expect that much additional delay. We would only see a new next-gen model publicly right now if they had started on it ASAP and then completely skipped the safety testing. I really hope no companies choose to skip safety testing!
if safety and quality testing is done (as I expect it will be), then flaws found could require additional time and effort to correct. I would expect multiple rounds of test—fine-tune—test—fine-tune before the final product is deemed suitable for release.
even after the product is deemed ready, there may be reasons for further delaying the release. These might include: deciding to focus on using the new model to distill the next generation of smaller cheaper models and wanting to be able to release all of them together as a set, waiting for a particularly dramatic or appropriate time to release in order to maximize expected public impact, wanting to scale/test/robustify their inference pipeline to make sure they’ll be able to handle the anticipated release-day surge, wanting to check if the model seems so good at recursive self-improvement that they need to dumb the public version down in order not to hand their competitors an advantage from using for ML research (could include making sure the model can’t replicate secret internal techniques, or even potentially poisoning the model with false information which would set competitors back).