romeo comments on Vladimir_Nesov’s Shortform

romeo 8 Mar 2025 3:09 UTC
12 points
0
I think ‘GB200’ refers to this column (2 Blackwell GPU + 1 Grace CPU) so 16K GB200s ~= 32K B200s ~= 80K H100s. Agree that it is still very low.

My guess is that Bloomberg’s phrasing is just misleading or the reporting is incomplete. For example, maybe they are only reporting the chips Oracle is contributing or something like that. I’d be very surprised if OpenAI don’t have access to >200K GB200s ~= 1M H100s by the end of 2025. For reference, that is only ~$20B capex (assuming $100k total cost of ownership per GB200) or roughly ¹⁄₄ of what Microsoft alone plan to invest this year.

Once they have just 100K GB200s, that should train 2e27 FLOP in 4 months.^[1]
1. ^
  There’s a nice correspondence between H100s and FLOP/month (assuming 40% utilisation and 16-bit precision) of 1e21 FLOP/month/H100. So since 100K GB200s = 500K H100s, that’s 5e26 FLOP/month.
- Vladimir_Nesov 8 Mar 2025 3:58 UTC
  10 points
  0
  Parent
  The marketing terminology is inconvenient, a “superchip” can mean 2-GPU or 4-GPU boards and even a 72-GPU system (1 or possibly 2 racks). So it’s better to talk in terms of chips (that are not “superchips”), which I think are all B200 run at slightly different clock speeds (not to be confused with B200A/B102/B20 that have 2 times less compute). In GB200, the chips are 2.5x faster than H100/H200 (not 5x faster; so a 200K chip GB200 system has the same compute as a 500K chip H100 system, not a 1M chip H100 system). Power requirements are often a good clue that helps disambiguate, compute doesn’t consistently help because it tends to get reported at randomly chosen precision and sparsity^[1].
  
  Large scale-up worlds (or good chips) are not necessarily very important in pretraining, especially in the later steps of the optimizer when the critical batch size gets high enough, so it’s not completely obvious that a training system will prefer to wait for NVL72 even if other packagings of Blackwell are more available earlier. Inference does benefit from NVL72 a lot, but for pretraining it’s just cheaper per FLOP than H100 and faster in wall clock time during the first ~3T tokens when the whole cluster can’t be used yet if the scale-up worlds are too small (see Section 3.4.1 of Llama 3 report).
  
  From the initial post by Crusoe (working on the Abilene campus), there is a vague mention of 200 MW and a much clearer claim that each data center building will host 100K GPUs. For GB200, all-in power per chip is 2 KW, so the 200 MW fits as a description of a data center building. The video that went out at the time of Jan 2025 Stargate announcement and also a SemiAnalysis aerial photo show two 4-section buildings. Dylan Patel claimed on Dwarkesh Podcast that the largest single-site campus associated with OpenAI/Microsoft being built in 2025 can hold 300K GB200 chips. From this I glean and I guess that each 4-section building can hold 100K chips of GB200 requiring 200 MW, and that they have two of these mostly built. And 200K chips of GB200 are sufficient to train a 2e27 FLOPs model (next scale after Grok 3′s ~3e26 FLOPs), so that makes sense as a step towards pretraining independence from Microsoft. But 16K chips or possibly 16K NVL4 superchips won’t make a difference, 100K H100s are on the same level (which GPT-4.5 suggests they already have available to them) and for inference Azure will have more Blackwells this year anyway.
  ↩︎
  For pretraining, you need dense compute rather than sparse. It’s unclear if FP8 rather than BF16 is widely used in pretraining of frontier models that are the first experiment at a new scale, or mostly in smaller or optimized models. But the GPT-4.5 announcement video vaguely mentions work on low precision in pretraining, and also high granularity MoE of the kind DeepSeek-V3 uses makes it more plausible for the FFN weights.
  - romeo 8 Mar 2025 4:45 UTC
    3 points
    0
    Parent
    That’s indeed inconvenient. I was aware of NVL2, NVL4, NVL36, NVL72, but I was under the impression that ‘GB200’ mentioned on its own always means 2 Blackwells, 1 Grace (unless you add on a ‘NVL__’). Are there counterexamples to this? I scanned the links you mentioned and only saw ‘GB200 NVL2,’ ‘GB200 NVL4,’ ‘GB200 NVL72’ respectively.
    
    I was operating on this pretty confidently but unsure where else I saw this described (apart from the column I linked above). On a quick search of ‘GB200 vs B200’ the first link I found seemed to corroborate GB200 = 2xB200s + 1xGrace CPU. Edit: second link also says: “the Grace-Blackwell GB200 Superchip. This is a module that has two B200 GPUs wired to an NVIDIA Grace CPU...”
    - Vladimir_Nesov 8 Mar 2025 5:11 UTC
      5 points
      1
      Parent
      “GB200 superchip” seems to be unambiguously Grace+2xB200. The issue is “100K GB200 GPUs” or “100K GB200 cluster”, and to some extent “100K GPU GB200 NVL72 cluster”. Also, people will abbreviate various clearer forms to just “GB200”. I think “100K chip GB200 NVL72 training system” less ambiguously refers to the number of B200s, but someone unfamiliar with this terminological nightmare might abbreviate it to “100K GB200 system”.
      - romeo 8 Mar 2025 22:48 UTC
        5 points
        0
        Parent
        Good point, thanks. Previously I would have pretty confidently read “100K GB200 GPUs,” or “100K GB200 cluster” as 200K B200s (~= 500K H100s) but I can see how it’s easily ambiguous. Now that I think of it, I remembered this Tom’s Hardware article where B200 and GB200 are mistakenly used interchangeably (compare the subtitle vs. the end of the first paragraph)...