Edit (20 Jul): These estimates erroneously use the sparse FP8 tensor performance for H100s (4 petaFLOP/s), which is 2 times higher than far more relevant dense FP8 tensor performance (2 petaFLOP/s). But with a Blackwell GPU, the relevant dense FP8 performance is 5 petaFLOP/s, which is close to 4 petaFLOP/s, and the cost and power per GPU within a rack are also similar. So the estimates approximately work out unchanged when reading “Blackwell GPU” instead of “H100″.
Thanks! I do wonder if he might not mean $1 billion total cost (e.g. to buy the hardware); because he also claims a $10 billion run might start in 2025, which seems quite surprising?
The $100 million figure is used in the same sentence for cost of currently deployed models. Original GPT-4 was probably trained on A100s in BF16 (A100s can’t do FP8 faster), which is 6e14 FLOP/s, 7 times less than 4e15 FLOP/s in FP8 from an H100 (there is no change in quality of trained models when going from BF16 to FP8, as long as training remains stable). With A100s in BF16 at 30% utilization for 150 days, you need 9K A100s to get 2e25 FLOPs. Assuming $30K per A100 together with associated infrastructure, the cluster would cost $250 million, but again assuming $2 per hour, the time would only cost $60 million. This is 2022, deployed in early 2023. I expect recent models to cost at least somewhat more, so for early 2024 frontier models $100 million would be solidly cost of time, not cost of infrastructure.
The $1 billion for cost of time suggests ability to train on multiple clusters, and Gemini 1.0 report basically says they did just that. So the $10 billion figure needs to be interpreted as being about scale of many clusters taken together, not individual clusters. The estimate for training on H100s for 200 days says you need 150 megawatts for $1 billion in training time, or 1.5 gigawatts for $10 billion in training time. And each hyperscaler has datacenters that consume 2-3 gigawatts in total (they are much smaller individually) with current plans to double. So at least the OOMs match the $10 billion claim interpreted as cost of training time.
Edit (20 Jul): These estimates erroneously use the sparse FP8 tensor performance for H100s (4 petaFLOP/s), which is 2 times higher than far more relevant dense FP8 tensor performance (2 petaFLOP/s). But with a Blackwell GPU, the relevant dense FP8 performance is 5 petaFLOP/s, which is close to 4 petaFLOP/s, and the cost and power per GPU within a rack are also similar. So the estimates approximately work out unchanged when reading “Blackwell GPU” instead of “H100″.
Dario Amodei claims there are current $1 billion training runs. At $2/hour with H100s, this means 2e12 H100-seconds. Assuming 30% utilization and 4e15 FP8 FLOP/s, this is 2e27 FLOPs, 2 OOMs above estimates for the original GPT-4. This corresponds to 200 days with 100K H100s (and 150 megawatts). 100K H100 clusters don’t seem to be built yet, the largest publicly known ones are Meta’s two clusters with 24K H100s each. But it might be possible to train on multiple clusters if the inter-cluster network is good enough.
Edit (20 Jul): These estimates erroneously use the sparse FP8 tensor performance for H100s (4 petaFLOP/s), which is 2 times higher than far more relevant dense FP8 tensor performance (2 petaFLOP/s). But with a Blackwell GPU, the relevant dense FP8 performance is 5 petaFLOP/s, which is close to 4 petaFLOP/s, and the cost and power per GPU within a rack are also similar. So the estimates approximately work out unchanged when reading “Blackwell GPU” instead of “H100″.
Thanks! I do wonder if he might not mean $1 billion total cost (e.g. to buy the hardware); because he also claims a $10 billion run might start in 2025, which seems quite surprising?
The $100 million figure is used in the same sentence for cost of currently deployed models. Original GPT-4 was probably trained on A100s in BF16 (A100s can’t do FP8 faster), which is 6e14 FLOP/s, 7 times less than 4e15 FLOP/s in FP8 from an H100 (there is no change in quality of trained models when going from BF16 to FP8, as long as training remains stable). With A100s in BF16 at 30% utilization for 150 days, you need 9K A100s to get 2e25 FLOPs. Assuming $30K per A100 together with associated infrastructure, the cluster would cost $250 million, but again assuming $2 per hour, the time would only cost $60 million. This is 2022, deployed in early 2023. I expect recent models to cost at least somewhat more, so for early 2024 frontier models $100 million would be solidly cost of time, not cost of infrastructure.
The $1 billion for cost of time suggests ability to train on multiple clusters, and Gemini 1.0 report basically says they did just that. So the $10 billion figure needs to be interpreted as being about scale of many clusters taken together, not individual clusters. The estimate for training on H100s for 200 days says you need 150 megawatts for $1 billion in training time, or 1.5 gigawatts for $10 billion in training time. And each hyperscaler has datacenters that consume 2-3 gigawatts in total (they are much smaller individually) with current plans to double. So at least the OOMs match the $10 billion claim interpreted as cost of training time.
Edit (20 Jul): These estimates erroneously use the sparse FP8 tensor performance for H100s (4 petaFLOP/s), which is 2 times higher than far more relevant dense FP8 tensor performance (2 petaFLOP/s). But with a Blackwell GPU, the relevant dense FP8 performance is 5 petaFLOP/s, which is close to 4 petaFLOP/s, and the cost and power per GPU within a rack are also similar. So the estimates approximately work out unchanged when reading “Blackwell GPU” instead of “H100″.