Vladimir_Nesov comments on Leon Lang’s Shortform

Vladimir_Nesov 25 Sep 2024 19:35 UTC
17 points
1
From $4 billion for a 150 megawatts cluster, I get 37 gigawatts for a $1 trillion cluster, or seven 5-gigawatts datacenters (if they solve geographically distributed training). Future GPUs will consume more power per GPU (though a transition to liquid cooling seems likely), but the corresponding fraction of the datacenter might also cost more. This is only a training system (other datacenters will be built for inference), and there is more than one player in this game, so the 100 gigawatts figure seems reasonable for this scenario.

Current best deployed models are about 5e25 FLOPs (possibly up to 1e26 FLOPs), very recent 100K H100s scale systems can train models for about 5e26 FLOPs in a few months. Building datacenters for 1 gigawatt scale seems already in progress, plausibly the models from these will start arriving in 2026. If we assume B200s, that’s enough to 15x the FLOP/s compared to 100K H100s, for 7e27 FLOPs in a few months, which is 5 trillion active parameters models (at 50 tokens/parameter).

The 5 gigawatts clusters seem more speculative for now, though o1-like post-training promises sufficient investment, once it’s demonstrated on top of 5e26+ FLOPs base models next year. That gets us to 5e28 FLOPs (assuming 30% FLOP/joule improvement over B200s). And then 35 gigawatts for 3e29 FLOPs, which might be 30 trillion active parameters models (at 60 tokens/parameter). This is 4 OOMs above the rumored 2e25 FLOPs of original GPT-4.

If each step of this process takes 18-24 months, and currently we just cleared 150 megawatts, there are 3 more steps to get $1 trillion training systems built, that is 2029-2030. If o1-like post-training works very well on top of larger-scale base models and starts really automating jobs, the impossible challenge of building these giant training systems this fast will be confronted by the impossible pressure of that success.
What links here?
- Vladimir_Nesov's comment on Alexander Gietelink Oldenziel’s Shortform by Alexander Gietelink Oldenziel (23 Oct 2024 14:25 UTC; 10 points)
- Vladimir_Nesov's comment on COT Scaling implies slower takeoff speeds by Logan Zoellner (28 Sep 2024 22:20 UTC; 6 points)