GPT-4 (Mar 2023 version) is rumored to have been trained on 25K A100s for 2e25 FLOPs, and Gemini 1.0 Ultra on TPUv4s (this detail is in the report) for 1e26 FLOPs. In BF16, A100s give 300 teraFLOP/s, TPUv4s 270 teraFLOP/s, H100s 1000 teraFLOP/s (marketing materials say 2000 teraFLOP/s, but that’s for sparse computation that isn’t relevant for training). So H100s have 3x advantage over hardware that trained GPT-4 and Gemini 1.0 Ultra. Llama-3-405b was trained on 16K H100s for about 2 months, getting 4e25 BF16 FLOPs at 40% compute utilization.
So assuming AWS’s offer is real for the purpose of training a single model on the whole 20K H100s cluster and was sufficiently liquid, for a year now 6 months of training could have yielded a 1.2e26 FLOPs model, which is 6x GPT-4, 3x Llama-3-405b, or on par with Gemini 1.0 Ultra. But much more than that wasn’t yet possible, not without running multiple such clusters in parallel, using geographically distributed training with low communication between clusters, something like DiLoCo. Now that 100K H100s clusters are getting online, 6 months of training will be giving about 5e26 FLOPs (assuming 30% utilization and that FP8 still couldn’t be made to work for training models at this scale).
Do you have a citation for the claim that Gemini 1.0 Ultra trained for 1e26 FLOPs? I had searched all around but can’t find any information on its compute cost.
I originally saw the estimate from EpochAI, which I think was either 8e25 FLOPs or 1e26 FLOPs, but I’m either misremembering or they changed the estimate, since currently they list 5e25 FLOPs (background info for a metaculus question claims the Epoch estimate was 9e25 FLOPs in Feb 2024). In Jun 2024, SemiAnalysis posted a plot with a dot for Gemini Ultra (very beginning of this post) where it’s placed at 7e25 FLOPs (they also slightly overestimate Llama-3-405B at 5e25 FLOPs, which wasn’t yet released then).
This number is an estimate based on limited evidence. In particular, we combine information about the performance of Gemini Ultra on various benchmarks compared to other models, and guesstimates about the hardware setup used for training to arrive at our estimate. Our reasoning and calculations are detailed in this Colab notebook.
https://colab.research.google.com/drive/1sfG91UfiYpEYnj_xB5YRy07T5dv-9O_c
Among other clues, the Colab notebook cites Gemini 1.0 report on use of TPUv4 in pods of 4096 across multiple datacenters for Gemini Ultra, claims that SemiAnalysis claims that Gemini Ultra could have been trained on 7+7 pods (which is 57K TPUs), and cites an article from The Information (paywalled):
Unlike OpenAI, which relied on Microsoft’s servers, Google operated its own data centers. It had even built its own specialized Al chip, the tensor processing unit, to run its software more efficiently. And it had amassed a staggering number of those chips for the Gemini effort-77,000 of the fourth-generation TPU, code-named Pufferfish.
One TPUv4 offers 275e12 FLOP/s, so at 40% MFU this gives 1.6e25 FLOPs a month by SemiAnalysis estimate on number of pods and 2.2e25 FLOPs a month by The Information’s claim on number of TPUs.
They arrive at a 6e25 FLOPs as the point estimate from hardware considerations. The training duration range is listed as 3-6 months before the code, but it’s actually 1-6 months in the code, so one of these is a bug. If we put 3-6 months in the code, their point estimate becomes 1e26 FLOPs. They also assume MFU of 40-60%, which seems too high to me.
If their claim of 7+7 pods from SemiAnalysis is combined with the 7e25 FLOPs estimate from the SemiAnalysis plot, this suggests training time of 4 months. At that duration, but with TPU count claim from The Information, we get 9e27 FLOPs. So after considering Epoch’s clues, I’m settling at 8e25 FLOPs as my own point estimate.
GPT-4 (Mar 2023 version) is rumored to have been trained on 25K A100s for 2e25 FLOPs, and Gemini 1.0 Ultra on TPUv4s (this detail is in the report) for 1e26 FLOPs. In BF16, A100s give 300 teraFLOP/s, TPUv4s 270 teraFLOP/s, H100s 1000 teraFLOP/s (marketing materials say 2000 teraFLOP/s, but that’s for sparse computation that isn’t relevant for training). So H100s have 3x advantage over hardware that trained GPT-4 and Gemini 1.0 Ultra. Llama-3-405b was trained on 16K H100s for about 2 months, getting 4e25 BF16 FLOPs at 40% compute utilization.
With 100K H100s, 1 month at 30% utilization gets you 8e25 FLOPs. OpenAI might have obtained this kind of training compute in May 2024, and xAI might get it at the end of 2024. AWS announced access to clusters with 20K H100s back in July 2023, which is 2e25 FLOPs a month at 40% utilization.
So assuming AWS’s offer is real for the purpose of training a single model on the whole 20K H100s cluster and was sufficiently liquid, for a year now 6 months of training could have yielded a 1.2e26 FLOPs model, which is 6x GPT-4, 3x Llama-3-405b, or on par with Gemini 1.0 Ultra. But much more than that wasn’t yet possible, not without running multiple such clusters in parallel, using geographically distributed training with low communication between clusters, something like DiLoCo. Now that 100K H100s clusters are getting online, 6 months of training will be giving about 5e26 FLOPs (assuming 30% utilization and that FP8 still couldn’t be made to work for training models at this scale).
Do you have a citation for the claim that Gemini 1.0 Ultra trained for 1e26 FLOPs? I had searched all around but can’t find any information on its compute cost.
I originally saw the estimate from EpochAI, which I think was either 8e25 FLOPs or 1e26 FLOPs, but I’m either misremembering or they changed the estimate, since currently they list 5e25 FLOPs (background info for a metaculus question claims the Epoch estimate was 9e25 FLOPs in Feb 2024). In Jun 2024, SemiAnalysis posted a plot with a dot for Gemini Ultra (very beginning of this post) where it’s placed at 7e25 FLOPs (they also slightly overestimate Llama-3-405B at 5e25 FLOPs, which wasn’t yet released then).
The current notes for the EpochAI estimate are linked from the model database csv file:
Among other clues, the Colab notebook cites Gemini 1.0 report on use of TPUv4 in pods of 4096 across multiple datacenters for Gemini Ultra, claims that SemiAnalysis claims that Gemini Ultra could have been trained on 7+7 pods (which is 57K TPUs), and cites an article from The Information (paywalled):
One TPUv4 offers 275e12 FLOP/s, so at 40% MFU this gives 1.6e25 FLOPs a month by SemiAnalysis estimate on number of pods and 2.2e25 FLOPs a month by The Information’s claim on number of TPUs.
They arrive at a 6e25 FLOPs as the point estimate from hardware considerations. The training duration range is listed as 3-6 months before the code, but it’s actually 1-6 months in the code, so one of these is a bug. If we put 3-6 months in the code, their point estimate becomes 1e26 FLOPs. They also assume MFU of 40-60%, which seems too high to me.
If their claim of 7+7 pods from SemiAnalysis is combined with the 7e25 FLOPs estimate from the SemiAnalysis plot, this suggests training time of 4 months. At that duration, but with TPU count claim from The Information, we get 9e27 FLOPs. So after considering Epoch’s clues, I’m settling at 8e25 FLOPs as my own point estimate.