From $4 billion for a 150 megawatts cluster, I get 37 gigawatts for a $1 trillion cluster, or seven 5-gigawatts datacenters (if they solve geographically distributed training). Future GPUs will consume more power per GPU (though a transition to liquid cooling seems likely), but the corresponding fraction of the datacenter might also cost more. This is only a training system (other datacenters will be built for inference), and there is more than one player in this game, so the 100 gigawatts figure seems reasonable for this scenario.
Current best deployed models are about 5e25 FLOPs (possibly up to 1e26 FLOPs), very recent 100K H100s scale systems can train models for about 5e26 FLOPs in a few months. Building datacenters for 1 gigawatt scale seems already in progress, plausibly the models from these will start arriving in 2026. If we assume B200s, that’s enough to 15x the FLOP/s compared to 100K H100s, for 7e27 FLOPs in a few months, which is 5 trillion active parameters models (at 50 tokens/parameter).
The 5 gigawatts clusters seem more speculative for now, though o1-like post-training promises sufficient investment, once it’s demonstrated on top of 5e26+ FLOPs base models next year. That gets us to 5e28 FLOPs (assuming 30% FLOP/joule improvement over B200s). And then 35 gigawatts for 3e29 FLOPs, which might be 30 trillion active parameters models (at 60 tokens/parameter). This is 4 OOMs above the rumored 2e25 FLOPs of original GPT-4.
If each step of this process takes 18-24 months, and currently we just cleared 150 megawatts, there are 3 more steps to get $1 trillion training systems built, that is 2029-2030. If o1-like post-training works very well on top of larger-scale base models and starts really automating jobs, the impossible challenge of building these giant training systems this fast will be confronted by the impossible pressure of that success.
From $4 billion for a 150 megawatts cluster, I get 37 gigawatts for a $1 trillion cluster, or seven 5-gigawatts datacenters (if they solve geographically distributed training). Future GPUs will consume more power per GPU (though a transition to liquid cooling seems likely), but the corresponding fraction of the datacenter might also cost more. This is only a training system (other datacenters will be built for inference), and there is more than one player in this game, so the 100 gigawatts figure seems reasonable for this scenario.
Current best deployed models are about 5e25 FLOPs (possibly up to 1e26 FLOPs), very recent 100K H100s scale systems can train models for about 5e26 FLOPs in a few months. Building datacenters for 1 gigawatt scale seems already in progress, plausibly the models from these will start arriving in 2026. If we assume B200s, that’s enough to 15x the FLOP/s compared to 100K H100s, for 7e27 FLOPs in a few months, which is 5 trillion active parameters models (at 50 tokens/parameter).
The 5 gigawatts clusters seem more speculative for now, though o1-like post-training promises sufficient investment, once it’s demonstrated on top of 5e26+ FLOPs base models next year. That gets us to 5e28 FLOPs (assuming 30% FLOP/joule improvement over B200s). And then 35 gigawatts for 3e29 FLOPs, which might be 30 trillion active parameters models (at 60 tokens/parameter). This is 4 OOMs above the rumored 2e25 FLOPs of original GPT-4.
If each step of this process takes 18-24 months, and currently we just cleared 150 megawatts, there are 3 more steps to get $1 trillion training systems built, that is 2029-2030. If o1-like post-training works very well on top of larger-scale base models and starts really automating jobs, the impossible challenge of building these giant training systems this fast will be confronted by the impossible pressure of that success.