The solution is increase in scale-up world size, but the “bug” I was talking about is in how it used to be too small for the sizes of LLMs that are compute optimal at the current level of training compute. With Blackwell NVL72, this is no longer the case, and shouldn’t again become the case going forward. Even though there was a theoretical Hopper NVL256, for whatever reason in practice everyone ended up with only Hopper NVL8.
The size of the effect of insufficient world size[1] depends on the size of the model, and gets more severe for reasoning models on long context, where with this year’s models each request would want to ask the system to generate (decode) on the order of 50K tokens while needing to maintain access to on the order of 100K tokens of KV-cache per trace. This might be the reason Hopper NVL256 never shipped, as this use case wasn’t really present in 2022-2024, but in 2025 it’s critically important, and so the incoming Blackwell NVL72/NVL36 systems will have a large impact.
(There are two main things a large world size helps with: it makes more HBM for KV-cache available, and it enables more aggressive tensor parallelism. When generating a token, the data for all previous tokens (KV-cache) needs to be available to process the attention blocks, and tokens for a given trace need to be generated sequentially, one at a time (or something like 1-4 at a time with speculative decoding). Generating one token only needs a little bit of compute, so it would be best to generate tokens for many traces at once, one for each, using more compute across these many tokens. But for this to work, all the KV-caches for all these traces need to sit in HBM. If the system would run out of memory, it needs to constrain the number of traces it’ll process within a single batch, which means the cost per trace (and per generated token) goes up, since the cost to use the system’s time is the same regardless of what it’s doing.
Tensor parallelism lets matrix multiplications go faster by using multiple chips for the same matrix multiplication. Since tokens need to be generated sequentially, one of the only ways to generate a long reasoning trace faster (with given hardware) is by using tensor parallelism (expert parallelism should also help when using high granularity MoE, where a significant number of experts within a layer is active at once, rather than the usual 2). And practical tensor parallelism is constrained to the world size.)
- ↩︎
As in this image (backup in-blog link) that in its most recent incarnation appeared in the GTC 2025 keynote (at 1:15:56).
The reason Rubin NVL576 probably won’t help as much as the current transition from Hopper is that Blackwell NVL72 is already ~sufficient for the model sizes that are compute optimal to train on $30bn Blackwell training systems (which Rubin NVL144 training systems probably won’t significantly leapfrog before Rubin NVL576 comes out, unless there are reliable agents in 2026-2027 and funding goes through the roof).
The terminology Huang was advocating for at GTC 2025 (at 1:28:04) is to use “GPU” to refer to compute dies rather than chips/packages, and in these terms a Rubin NVL576 rack has 144 chips and 576 GPUs, rather than 144 GPUs. Even though this seems contentious, the terms compute die and chip/package remain less ambiguous than “GPU”.