Vladimir_Nesov comments on O O’s Shortform

Vladimir_Nesov 4 Oct 2024 23:45 UTC
15 points
2
Frontier model training requires that you build the largest training system yourself, because there is no such system already available for you to rent time on. Currently Microsoft builds these systems for OpenAI, and Amazon for Anthropic, and it’s Microsoft and Amazon that own these systems, so OpenAI and Anthropic don’t pay for them in full. Google, xAI and Meta build their own.

Models that are already deployed hold about 5e25 FLOPs and need about 15K H100s to be trained in a few months. These training systems cost about $700 million to build. Musk announced that the Memphis cluster got 100K H100s working in Sep 2024, OpenAI reportedly got a 100K H100s cluster working in May 2024, and Zuckerberg recently said that Llama 4 will be trained on over 100K GPUs. These systems cost $4-5 billion to build and we’ll probably start seeing 5e26 FLOPs models trained on them starting this winter. OpenAI, Anthropic, and xAI each had billions invested in them, some of it in compute credits for the first two, so the orders of magnitude add up. This is just training, more goes to inference, but presumably the revenue covers that part.

There are already plans to scale to 1 gigawatt by the end of next year, both for Google and for Microsoft, which in the latter case is 500K B200s across multiple sites, which should require about $30-50 billion. Possibly we’ll start seeing the first 7e27 FLOPs models (about 300x original GPT-4) in second half of 2026 (or maybe they’ll be seeing us). So for now, OpenAI has no ability to escape Microsoft’s patronage, because it can’t secure enough funding in time to start catching up with the next level of scale. And Microsoft is motivated to keep sponsoring OpenAI according to what the current level of scaling demands, as long as it’s willing to build the next frontier training system.

So far, the yearly capital expenditures of Microsoft Azure, Google Cloud Platform, and Amazon Web Services are about $50 billion each, which includes their development across the whole world, so 2025 is going to start stressing their budgets. Also, I’m not aware of what’s going on with Amazon for 2025 and 1 gigawatt clusters (or even 2024 and 100K H100s clusters), and Musk mentioned plans for 300K B200s by summer 2025.
What links here?
- Musings on Text Data Wall (Oct 2024) by Vladimir_Nesov (5 Oct 2024 19:00 UTC; 20 points)