They don’t claim that Grok 3 was trained on 200K GPUs, and that can’t actually be the case from other things they say. The first 100K H100s were done early Sep 2024, and the subsequent 100K H200s took them 92 days to set up, so early Dec 2024 at the earliest if they started immediately, which they didn’t necessarily. But pretraining of Grok 3 was done by Jan 2025, so there wasn’t enough time with the additional H200s.
There is also a plot where Grok 2 compute is shown slightly above that of GPT-4, so maybe 3e25 FLOPs. And Grok 3 compute is said to be either 10x or 15x that of Grok 2 compute. The 15x figure is given by Musk, who also discussed how Grok 2 was trained with less than 8K GPUs, so possibly he was just talking about the number of GPUs, as opposed to the 10x figure named by a team member that was possibly about the amount of compute. This points to 3e26 FLOPs for Grok 3, which on 100K H100s at 40% utilization would take 3 months, a plausible amount of time if everything worked on almost the first try.
Time needed to build a datacenter given the funding and chips isn’t particularly important for timelines, only for catching up to the frontier (as long as it’s 3 months vs. 6 months and not 18 months). Timelines are constrained by securing more funding for a training system, and designing and manufacturing better chips. Another thing on that presentation was a claim of starting work on another 1.2 GW GB200/GB300 datacenter, which translates to 600K chips. This appears to be more than other LLM labs will construct this year, which might be only about 0.5 GW, except for Google[1], but then Musk didn’t name deadlines for 1.2 GW either. It’s only more concrete than Meta’s 2 GW site in specifying that the chips are Blackwell, so it can’t be about plans for 2027 when better chips will be available.
For Claude 3.5, Amodei says the training time cost “a few $10M’s”, which translates to between 1e25 FLOPs (H100, $40M, $4/hour, 30% utilization, BF16) and 1e26 FLOPs (H100, $80M, $2/hour, 50% utilization, FP8), my point estimate is 4e25 FLOPs.
GPT-4o was trained around the same time (late 2023 to very early 2024), and given that the current OpenAI training system seems to take the form of three buildings totaling 100K H100s (the Goodyear, Arizona site), they probably had one of those for 32K H100s, which in 3 months at 40% utilization in BF16 gives 1e26 FLOPs.
Gemini 2.0 was released concurrently with the announcement of general availability of 100K TPUv6e clusters (the instances you can book are much smaller), so they probably have several of them, and Jeff Dean’s remarks suggest they might’ve been able to connect some of them for purposes of pretraining. Each one can contribute 3e26 FLOPs (conservatively assuming BF16). Hassabis noted on some podcast a few months back that scaling compute 10x each generation seems like a good number to fight through the engineering challenges. Gemini 1.0 Ultra was trained on either 77K TPUv4 (according to The Information) or 14 4096-TPUv4 pods (according to EpochAI’s quote from SemiAnalysis), so my point estimate for Gemini 1.0 Ultra is 8e25 FLOPs.
This gives 6e26-9e26 FLOPs for Gemini 2.0 (from 2-3 100K TPUv6e clusters). But unclear if this is what went into Gemini 2.0 Pro or if there is also an unmentioned Gemini 2.0 Ultra down the line.
It seems that we are already at the GPT 4.5 level? Except that reasoning models have confused everything, and increasing OOM on output can have the same effect as ~OOM on training, as I understand it.
By the way, you’ve analyzed the scaling of pretraining a lot. But what about inference scaling? It seems that o3 has already used thousands of GPUs to solve tasks in ARC-AGI.
They don’t claim that Grok 3 was trained on 200K GPUs, and that can’t actually be the case from other things they say. The first 100K H100s were done early Sep 2024, and the subsequent 100K H200s took them 92 days to set up, so early Dec 2024 at the earliest if they started immediately, which they didn’t necessarily. But pretraining of Grok 3 was done by Jan 2025, so there wasn’t enough time with the additional H200s.
There is also a plot where Grok 2 compute is shown slightly above that of GPT-4, so maybe 3e25 FLOPs. And Grok 3 compute is said to be either 10x or 15x that of Grok 2 compute. The 15x figure is given by Musk, who also discussed how Grok 2 was trained with less than 8K GPUs, so possibly he was just talking about the number of GPUs, as opposed to the 10x figure named by a team member that was possibly about the amount of compute. This points to 3e26 FLOPs for Grok 3, which on 100K H100s at 40% utilization would take 3 months, a plausible amount of time if everything worked on almost the first try.
Time needed to build a datacenter given the funding and chips isn’t particularly important for timelines, only for catching up to the frontier (as long as it’s 3 months vs. 6 months and not 18 months). Timelines are constrained by securing more funding for a training system, and designing and manufacturing better chips. Another thing on that presentation was a claim of starting work on another 1.2 GW GB200/GB300 datacenter, which translates to 600K chips. This appears to be more than other LLM labs will construct this year, which might be only about 0.5 GW, except for Google[1], but then Musk didn’t name deadlines for 1.2 GW either. It’s only more concrete than Meta’s 2 GW site in specifying that the chips are Blackwell, so it can’t be about plans for 2027 when better chips will be available.
On a recent podcast, Jeff Dean stated more clearly that their synchronous multi-datacenter training works between metro areas (not just for very-nearby datacenters), and in Dec 2024 they’ve started general availability for 100K TPUv6e clusters. A TPUv6e has similar performance to an H100, and there are two areas being built up in 2025, each with 1 GW of Google datacenters near each other. So there’s potential for 1M H100s or 400K B200s worth of compute, or even double that if these areas or others can be connected with sufficient bandwidth.
Can we assume that Gemini 2.0, GPT-4o, Claude 3.5 and other models with similar performance have a similar compute?
For Claude 3.5, Amodei says the training time cost “a few $10M’s”, which translates to between 1e25 FLOPs (H100, $40M, $4/hour, 30% utilization, BF16) and 1e26 FLOPs (H100, $80M, $2/hour, 50% utilization, FP8), my point estimate is 4e25 FLOPs.
GPT-4o was trained around the same time (late 2023 to very early 2024), and given that the current OpenAI training system seems to take the form of three buildings totaling 100K H100s (the Goodyear, Arizona site), they probably had one of those for 32K H100s, which in 3 months at 40% utilization in BF16 gives 1e26 FLOPs.
Gemini 2.0 was released concurrently with the announcement of general availability of 100K TPUv6e clusters (the instances you can book are much smaller), so they probably have several of them, and Jeff Dean’s remarks suggest they might’ve been able to connect some of them for purposes of pretraining. Each one can contribute 3e26 FLOPs (conservatively assuming BF16). Hassabis noted on some podcast a few months back that scaling compute 10x each generation seems like a good number to fight through the engineering challenges. Gemini 1.0 Ultra was trained on either 77K TPUv4 (according to The Information) or 14 4096-TPUv4 pods (according to EpochAI’s quote from SemiAnalysis), so my point estimate for Gemini 1.0 Ultra is 8e25 FLOPs.
This gives 6e26-9e26 FLOPs for Gemini 2.0 (from 2-3 100K TPUv6e clusters). But unclear if this is what went into Gemini 2.0 Pro or if there is also an unmentioned Gemini 2.0 Ultra down the line.
Thank you. In conditions of extreme uncertainty about the timing and impact of AGI, it’s nice to know at least something definite.
It seems that we are already at the GPT 4.5 level? Except that reasoning models have confused everything, and increasing OOM on output can have the same effect as ~OOM on training, as I understand it.
By the way, you’ve analyzed the scaling of pretraining a lot. But what about inference scaling? It seems that o3 has already used thousands of GPUs to solve tasks in ARC-AGI.