That’s indeed inconvenient. I was aware of NVL2, NVL4, NVL36, NVL72, but I was under the impression that ‘GB200’ mentioned on its own always means 2 Blackwells, 1 Grace (unless you add on a ‘NVL__’). Are there counterexamples to this? I scanned the links you mentioned and only saw ‘GB200 NVL2,’ ‘GB200 NVL4,’ ‘GB200 NVL72’ respectively.
I was operating on this pretty confidently but unsure where else I saw this described (apart from the column I linked above). On a quick search of ‘GB200 vs B200’ the first link I found seemed to corroborate GB200 = 2xB200s + 1xGrace CPU. Edit: second link also says: “the Grace-Blackwell GB200 Superchip. This is a module that has two B200 GPUs wired to an NVIDIA Grace CPU...”
romeo
AI 2027: What Superintelligence Looks Like
I think ‘GB200’ refers to this column (2 Blackwell GPU + 1 Grace CPU) so 16K GB200s ~= 32K B200s ~= 80K H100s. Agree that it is still very low.
My guess is that Bloomberg’s phrasing is just misleading or the reporting is incomplete. For example, maybe they are only reporting the chips Oracle is contributing or something like that. I’d be very surprised if OpenAI don’t have access to >200K GB200s ~= 1M H100s by the end of 2025. For reference, that is only ~$20B capex (assuming $100k total cost of ownership per GB200) or roughly 1⁄4 of what Microsoft alone plan to invest this year.
Once they have just 100K GB200s, that should train 2e27 FLOP in 4 months.[1]- ^
There’s a nice correspondence between H100s and FLOP/month (assuming 40% utilisation and 16-bit precision) of 1e21 FLOP/month/H100. So since 100K GB200s = 500K H100s, that’s 5e26 FLOP/month.
- ^
Thanks Vladimir, this is really interesting!
Re: OpenAI’s compute, I inferred from this NYT article that their $8.7B costs this year were likely to include about $6B in compute costs, which implies an average use of ~274k H100s throughout the year[1] (assuming $2.50/hr average H100 rental price). Assuming this was their annual average, I would’ve guessed they’d be on track to be using around 400k H100s by now.
So the 150k H100s campus in Phoenix might be only a small fraction of the total compute they have access to? Does this sound plausible?
The co-location of the Trainium2 cluster might give Anthropic a short term advantage, though I think its actually quite unclear if their networking and topology will fully enable this advantage. Perhaps the OpenAI Phoenix campus is well-connected enough to another OpenAI campus to be doing a 2-campus asynchronous training run effectively.- ^
$6e9 / 365.25d / 24h / $2.5/hr = 274k
- ^
Good point, thanks. Previously I would have pretty confidently read “100K GB200 GPUs,” or “100K GB200 cluster” as 200K B200s (~= 500K H100s) but I can see how it’s easily ambiguous. Now that I think of it, I remembered this Tom’s Hardware article where B200 and GB200 are mistakenly used interchangeably (compare the subtitle vs. the end of the first paragraph)...