I think a little more explanation is required on why there isn’t already a model with 5-10x* more compute than GPT-4 (which would be “4.5 level” given that GPT version numbers have historically gone up by 1 for every two OOMs, though I think the model literally called GPT-5 will only be a roughly 10x scale-up).
You’d need around 100,000 H100s (or maybe somewhat fewer; Llama 3.1 was 2x GPT-4 and trained using 16,000 H100s) to train a model at 10x GPT-4. This has been available to the biggest hyperscalers since sometime last year. Naively it might take ~9 months from taking delivery of chips to releasing a model (perhaps 3 months to set up the cluster, 3 months for pre-training, 3 months of post-training, evaluations, etc). But most likely the engineering challenges in building a cluster that big, which is unprecedented, and perhaps high demand for inference, has prevented them from concentrating that much compute into one training run in time to release a model by now.
*I’m not totally sure the 5x threshold (1e26 FLOP) hasn’t been breached but most people think it hasn’t.
GPT-4 (Mar 2023 version) is rumored to have been trained on 25K A100s for 2e25 FLOPs, and Gemini 1.0 Ultra on TPUv4s (this detail is in the report) for 1e26 FLOPs. In BF16, A100s give 300 teraFLOP/s, TPUv4s 270 teraFLOP/s, H100s 1000 teraFLOP/s (marketing materials say 2000 teraFLOP/s, but that’s for sparse computation that isn’t relevant for training). So H100s have 3x advantage over hardware that trained GPT-4 and Gemini 1.0 Ultra. Llama-3-405b was trained on 16K H100s for about 2 months, getting 4e25 BF16 FLOPs at 40% compute utilization.
So assuming AWS’s offer is real for the purpose of training a single model on the whole 20K H100s cluster and was sufficiently liquid, for a year now 6 months of training could have yielded a 1.2e26 FLOPs model, which is 6x GPT-4, 3x Llama-3-405b, or on par with Gemini 1.0 Ultra. But much more than that wasn’t yet possible, not without running multiple such clusters in parallel, using geographically distributed training with low communication between clusters, something like DiLoCo. Now that 100K H100s clusters are getting online, 6 months of training will be giving about 5e26 FLOPs (assuming 30% utilization and that FP8 still couldn’t be made to work for training models at this scale).
Do you have a citation for the claim that Gemini 1.0 Ultra trained for 1e26 FLOPs? I had searched all around but can’t find any information on its compute cost.
I originally saw the estimate from EpochAI, which I think was either 8e25 FLOPs or 1e26 FLOPs, but I’m either misremembering or they changed the estimate, since currently they list 5e25 FLOPs (background info for a metaculus question claims the Epoch estimate was 9e25 FLOPs in Feb 2024). In Jun 2024, SemiAnalysis posted a plot with a dot for Gemini Ultra (very beginning of this post) where it’s placed at 7e25 FLOPs (they also slightly overestimate Llama-3-405B at 5e25 FLOPs, which wasn’t yet released then).
This number is an estimate based on limited evidence. In particular, we combine information about the performance of Gemini Ultra on various benchmarks compared to other models, and guesstimates about the hardware setup used for training to arrive at our estimate. Our reasoning and calculations are detailed in this Colab notebook.
https://colab.research.google.com/drive/1sfG91UfiYpEYnj_xB5YRy07T5dv-9O_c
Among other clues, the Colab notebook cites Gemini 1.0 report on use of TPUv4 in pods of 4096 across multiple datacenters for Gemini Ultra, claims that SemiAnalysis claims that Gemini Ultra could have been trained on 7+7 pods (which is 57K TPUs), and cites an article from The Information (paywalled):
Unlike OpenAI, which relied on Microsoft’s servers, Google operated its own data centers. It had even built its own specialized Al chip, the tensor processing unit, to run its software more efficiently. And it had amassed a staggering number of those chips for the Gemini effort-77,000 of the fourth-generation TPU, code-named Pufferfish.
One TPUv4 offers 275e12 FLOP/s, so at 40% MFU this gives 1.6e25 FLOPs a month by SemiAnalysis estimate on number of pods and 2.2e25 FLOPs a month by The Information’s claim on number of TPUs.
They arrive at a 6e25 FLOPs as the point estimate from hardware considerations. The training duration range is listed as 3-6 months before the code, but it’s actually 1-6 months in the code, so one of these is a bug. If we put 3-6 months in the code, their point estimate becomes 1e26 FLOPs. They also assume MFU of 40-60%, which seems too high to me.
If their claim of 7+7 pods from SemiAnalysis is combined with the 7e25 FLOPs estimate from the SemiAnalysis plot, this suggests training time of 4 months. At that duration, but with TPU count claim from The Information, we get 9e27 FLOPs. So after considering Epoch’s clues, I’m settling at 8e25 FLOPs as my own point estimate.
Yes, good point Josh. If the biggest labs had been pushing as fast as possible, they could have a next model by now.
I don’t have a definite answer to this, but I have some guesses.
It could be a combination of any of these.
Keeping up with inference demand, as Josh mentioned
Wanting to focus on things other than getting the next big model out ASAP:
multimodality (e.g. GPT-4o), better versions of cheaper smaller models (e.g. Sonnet 3.5, Gemini Flash), non-capabilites work like safety or watermarking
choosing to put more time and effort into improving the data /code/ training process which will be used for the next large model run. Potentially including: smaller scale experiments to test ideas, cleaning data, improving synthetic data generation (strawberry?), gathering new data to cover specific weak spots (perhaps by paying people to create it), developing and testing better engineering infrastructure to support larger runs
wanting to spend extra time evaluating performance of the checkpoints partway through training to make sure everything is working as expected. Larger scale means mistakes are much more costly. Mistakes caught early in the training process are less costly overall.
wanting to spend more time and effort evaluating the final product. There were several months where GPT-4 existed internally and got tested in a bunch of different ways. Nathan Labenz tells interesting stories of his time as a pre-release tester. Hopefully, with the new larger generation of models the companies will spend even more time and effort evaluating the new capabilities. If they scaled up their evaluation time from 6-8 months to 12-18 months , then we’d expect that much additional delay. We would only see a new next-gen model publicly right now if they had started on it ASAP and then completely skipped the safety testing. I really hope no companies choose to skip safety testing!
if safety and quality testing is done (as I expect it will be), then flaws found could require additional time and effort to correct. I would expect multiple rounds of test—fine-tune—test—fine-tune before the final product is deemed suitable for release.
even after the product is deemed ready, there may be reasons for further delaying the release. These might include: deciding to focus on using the new model to distill the next generation of smaller cheaper models and wanting to be able to release all of them together as a set, waiting for a particularly dramatic or appropriate time to release in order to maximize expected public impact, wanting to scale/test/robustify their inference pipeline to make sure they’ll be able to handle the anticipated release-day surge, wanting to check if the model seems so good at recursive self-improvement that they need to dumb the public version down in order not to hand their competitors an advantage from using for ML research (could include making sure the model can’t replicate secret internal techniques, or even potentially poisoning the model with false information which would set competitors back).
I think a little more explanation is required on why there isn’t already a model with 5-10x* more compute than GPT-4 (which would be “4.5 level” given that GPT version numbers have historically gone up by 1 for every two OOMs, though I think the model literally called GPT-5 will only be a roughly 10x scale-up).
You’d need around 100,000 H100s (or maybe somewhat fewer; Llama 3.1 was 2x GPT-4 and trained using 16,000 H100s) to train a model at 10x GPT-4. This has been available to the biggest hyperscalers since sometime last year. Naively it might take ~9 months from taking delivery of chips to releasing a model (perhaps 3 months to set up the cluster, 3 months for pre-training, 3 months of post-training, evaluations, etc). But most likely the engineering challenges in building a cluster that big, which is unprecedented, and perhaps high demand for inference, has prevented them from concentrating that much compute into one training run in time to release a model by now.
*I’m not totally sure the 5x threshold (1e26 FLOP) hasn’t been breached but most people think it hasn’t.
GPT-4 (Mar 2023 version) is rumored to have been trained on 25K A100s for 2e25 FLOPs, and Gemini 1.0 Ultra on TPUv4s (this detail is in the report) for 1e26 FLOPs. In BF16, A100s give 300 teraFLOP/s, TPUv4s 270 teraFLOP/s, H100s 1000 teraFLOP/s (marketing materials say 2000 teraFLOP/s, but that’s for sparse computation that isn’t relevant for training). So H100s have 3x advantage over hardware that trained GPT-4 and Gemini 1.0 Ultra. Llama-3-405b was trained on 16K H100s for about 2 months, getting 4e25 BF16 FLOPs at 40% compute utilization.
With 100K H100s, 1 month at 30% utilization gets you 8e25 FLOPs. OpenAI might have obtained this kind of training compute in May 2024, and xAI might get it at the end of 2024. AWS announced access to clusters with 20K H100s back in July 2023, which is 2e25 FLOPs a month at 40% utilization.
So assuming AWS’s offer is real for the purpose of training a single model on the whole 20K H100s cluster and was sufficiently liquid, for a year now 6 months of training could have yielded a 1.2e26 FLOPs model, which is 6x GPT-4, 3x Llama-3-405b, or on par with Gemini 1.0 Ultra. But much more than that wasn’t yet possible, not without running multiple such clusters in parallel, using geographically distributed training with low communication between clusters, something like DiLoCo. Now that 100K H100s clusters are getting online, 6 months of training will be giving about 5e26 FLOPs (assuming 30% utilization and that FP8 still couldn’t be made to work for training models at this scale).
Do you have a citation for the claim that Gemini 1.0 Ultra trained for 1e26 FLOPs? I had searched all around but can’t find any information on its compute cost.
I originally saw the estimate from EpochAI, which I think was either 8e25 FLOPs or 1e26 FLOPs, but I’m either misremembering or they changed the estimate, since currently they list 5e25 FLOPs (background info for a metaculus question claims the Epoch estimate was 9e25 FLOPs in Feb 2024). In Jun 2024, SemiAnalysis posted a plot with a dot for Gemini Ultra (very beginning of this post) where it’s placed at 7e25 FLOPs (they also slightly overestimate Llama-3-405B at 5e25 FLOPs, which wasn’t yet released then).
The current notes for the EpochAI estimate are linked from the model database csv file:
Among other clues, the Colab notebook cites Gemini 1.0 report on use of TPUv4 in pods of 4096 across multiple datacenters for Gemini Ultra, claims that SemiAnalysis claims that Gemini Ultra could have been trained on 7+7 pods (which is 57K TPUs), and cites an article from The Information (paywalled):
One TPUv4 offers 275e12 FLOP/s, so at 40% MFU this gives 1.6e25 FLOPs a month by SemiAnalysis estimate on number of pods and 2.2e25 FLOPs a month by The Information’s claim on number of TPUs.
They arrive at a 6e25 FLOPs as the point estimate from hardware considerations. The training duration range is listed as 3-6 months before the code, but it’s actually 1-6 months in the code, so one of these is a bug. If we put 3-6 months in the code, their point estimate becomes 1e26 FLOPs. They also assume MFU of 40-60%, which seems too high to me.
If their claim of 7+7 pods from SemiAnalysis is combined with the 7e25 FLOPs estimate from the SemiAnalysis plot, this suggests training time of 4 months. At that duration, but with TPU count claim from The Information, we get 9e27 FLOPs. So after considering Epoch’s clues, I’m settling at 8e25 FLOPs as my own point estimate.
Yes, good point Josh. If the biggest labs had been pushing as fast as possible, they could have a next model by now. I don’t have a definite answer to this, but I have some guesses. It could be a combination of any of these.
Keeping up with inference demand, as Josh mentioned
Wanting to focus on things other than getting the next big model out ASAP: multimodality (e.g. GPT-4o), better versions of cheaper smaller models (e.g. Sonnet 3.5, Gemini Flash), non-capabilites work like safety or watermarking
choosing to put more time and effort into improving the data /code/ training process which will be used for the next large model run. Potentially including: smaller scale experiments to test ideas, cleaning data, improving synthetic data generation (strawberry?), gathering new data to cover specific weak spots (perhaps by paying people to create it), developing and testing better engineering infrastructure to support larger runs
wanting to spend extra time evaluating performance of the checkpoints partway through training to make sure everything is working as expected. Larger scale means mistakes are much more costly. Mistakes caught early in the training process are less costly overall.
wanting to spend more time and effort evaluating the final product. There were several months where GPT-4 existed internally and got tested in a bunch of different ways. Nathan Labenz tells interesting stories of his time as a pre-release tester. Hopefully, with the new larger generation of models the companies will spend even more time and effort evaluating the new capabilities. If they scaled up their evaluation time from 6-8 months to 12-18 months , then we’d expect that much additional delay. We would only see a new next-gen model publicly right now if they had started on it ASAP and then completely skipped the safety testing. I really hope no companies choose to skip safety testing!
if safety and quality testing is done (as I expect it will be), then flaws found could require additional time and effort to correct. I would expect multiple rounds of test—fine-tune—test—fine-tune before the final product is deemed suitable for release.
even after the product is deemed ready, there may be reasons for further delaying the release. These might include: deciding to focus on using the new model to distill the next generation of smaller cheaper models and wanting to be able to release all of them together as a set, waiting for a particularly dramatic or appropriate time to release in order to maximize expected public impact, wanting to scale/test/robustify their inference pipeline to make sure they’ll be able to handle the anticipated release-day surge, wanting to check if the model seems so good at recursive self-improvement that they need to dumb the public version down in order not to hand their competitors an advantage from using for ML research (could include making sure the model can’t replicate secret internal techniques, or even potentially poisoning the model with false information which would set competitors back).