I meant “realiable agents” in the AI 2027 sense, that is something on the order of being sufficient for automated AI research, leading to much more revenue and investment in the lead-up rather than stalling at ~$100bn per individual training system for multiple years. My point is that it’s not currently knowable if it happens imminently in 2026-2027 or at least a few years later, meaning I don’t expect that evidence exists that distinguishes these possibilities even within the leading AI companies.
Vladimir_Nesov
The reason Rubin NVL576 probably won’t help as much as the current transition from Hopper is that Blackwell NVL72 is already ~sufficient for the model sizes that are compute optimal to train on $30bn Blackwell training systems (which Rubin NVL144 training systems probably won’t significantly leapfrog before Rubin NVL576 comes out, unless there are reliable agents in 2026-2027 and funding goes through the roof).
when we get 576 (194 gpus)
The terminology Huang was advocating for at GTC 2025 (at 1:28:04) is to use “GPU” to refer to compute dies rather than chips/packages, and in these terms a Rubin NVL576 rack has 144 chips and 576 GPUs, rather than 144 GPUs. Even though this seems contentious, the terms compute die and chip/package remain less ambiguous than “GPU”.
The solution is increase in scale-up world size, but the “bug” I was talking about is in how it used to be too small for the sizes of LLMs that are compute optimal at the current level of training compute. With Blackwell NVL72, this is no longer the case, and shouldn’t again become the case going forward. Even though there was a theoretical Hopper NVL256, for whatever reason in practice everyone ended up with only Hopper NVL8.
The size of the effect of insufficient world size[1] depends on the size of the model, and gets more severe for reasoning models on long context, where with this year’s models each request would want to ask the system to generate (decode) on the order of 50K tokens while needing to maintain access to on the order of 100K tokens of KV-cache per trace. This might be the reason Hopper NVL256 never shipped, as this use case wasn’t really present in 2022-2024, but in 2025 it’s critically important, and so the incoming Blackwell NVL72/NVL36 systems will have a large impact.
(There are two main things a large world size helps with: it makes more HBM for KV-cache available, and it enables more aggressive tensor parallelism. When generating a token, the data for all previous tokens (KV-cache) needs to be available to process the attention blocks, and tokens for a given trace need to be generated sequentially, one at a time (or something like 1-4 at a time with speculative decoding). Generating one token only needs a little bit of compute, so it would be best to generate tokens for many traces at once, one for each, using more compute across these many tokens. But for this to work, all the KV-caches for all these traces need to sit in HBM. If the system would run out of memory, it needs to constrain the number of traces it’ll process within a single batch, which means the cost per trace (and per generated token) goes up, since the cost to use the system’s time is the same regardless of what it’s doing.
Tensor parallelism lets matrix multiplications go faster by using multiple chips for the same matrix multiplication. Since tokens need to be generated sequentially, one of the only ways to generate a long reasoning trace faster (with given hardware) is by using tensor parallelism (expert parallelism should also help when using high granularity MoE, where a significant number of experts within a layer is active at once, rather than the usual 2). And practical tensor parallelism is constrained to the world size.)
- ↩︎
As in this image (backup in-blog link) that in its most recent incarnation appeared in the GTC 2025 keynote (at 1:15:56).
- ↩︎
The loss goes down; whether that helps in some more legible way that also happens to be impactful is much harder to figure out. The experiments in the May 2023 paper show that training on some dataset and training on a random quarter of that dataset repeated 4 times result in approximately the same loss (Figure 4). Even 15 repetitions remain useful, though at that point somewhat less useful than 15 times more unique data. There is also some sort of double descent where loss starts getting better again after hundreds of repetitions (Figure 9 in Appendix D).
This strongly suggests that repeating merely 3 times will robustly be about as useful as having 3 times more data from the same distribution. I don’t know of comparably strong clues that would change this expectation.
I think Blackwell will change the sentiment by late 2025 compared to 2024, with a lot of apparent progress in capabilities and reduced prices (which the public will have a hard time correctly attributing to Blackwell). In 2026 there will be some Blackwell-trained models, using 2x-4x more compute than what we see today (or what we’ll see more of in a few weeks to months with the added long reasoning option, such as GPT-4.5 with reasoning).
But then the possibilities for 2027 branch on whether there are reliable agents, which doesn’t seem knowable either way right now. If this doesn’t work out, in particular because R1-like RL training doesn’t scale or generalize, then by 2027 nothing substantially new will happen, and the 2024-style slowdown sentiment will return, since 3x-5x increase in training compute is not a game-changing amount (unless there is a nearby threshold to be reached), and Blackwell is a one-time thing that essentially fixes a bug in Ampere/Hopper design (in efficiency for LLM inference) and can’t be repeated even with Rubin Ultra NVL576. At that point individual training systems will cost on the order of $100bn, and so won’t have much further to scale other than at the slower pace of chip improvement (within the assumption of absence of reliable agents). The Chinese AI companies will be more than 10x but less than 100x behind in training compute (mostly because AI fails to become a priority), which can occasionally but not reliably be surmounted with brilliant engineering innovations.
A power seeker is ambitious without an ambition, which is not an implication of being agentic.
The announcement post says the following on the scale of Behemoth:
we focus on efficient model training by using FP8 precision, without sacrificing quality and ensuring high model FLOPs utilization—while pre-training our Llama 4 Behemoth model using FP8 and 32K GPUs, we achieved 390 TFLOPs/GPU. The overall data mixture for training consisted of more than 30 trillion tokens
This puts Llama 4 Behemoth at 5e25 FLOPs (30% more than Llama-3-405B), trained on 32K H100s (only 2x more than Llama-3-405B) instead of the 128K H100s (or in any case, 100K+) they should have. They are training in FP8 (which gets 2x more FLOP/s per chip than the easier-to-work-with BF16), but with 20% compute utilization (2x lower than in dense Llama-3-405B; training MoE is harder).
At 1:8 sparsity (2T total parameters, ~250B in active experts), it should have 3x lower data efficiency than a dense model (and 3x as much effective compute, so it has 4x effective compute of Llama-3-405B even at merely 1.3x raw compute). Anchoring to Llama-3-405B, which is dense and has 38 tokens per parameter compute optimal with their dataset, we get about 120 tokens per active parameter optimal for a model with Behemoth’s shape, which for 288B active parameters gives 35T tokens. This fits their 30T tokens very well, so it’s indeed a compute optimal model (and not a middle-sized overtrained model that inherited the title of “Behemoth” from a failed 128K H100s run).
In any case, for some reason they didn’t do a large training run their hardware in principle enables, and even then their training run was only about 2 months (1.5 months from total compute and utilization, plus a bit longer at the start to increase critical batch size enough to start training on the whole training system). (Running out of data shouldn’t be a reason to give up on 128K H100s, as a compute optimal 1:8 sparsity model would’ve needed only 90T tokens at 750B active parameters, if trained in FP8 with 20% compute utilization for 3 months. Which could just be the same 30T tokens repeated 3 times.)
For me a specific crux is scaling laws of R1-like training, what happens when you try to do much more of it, which inputs to this process become important constraints and how much they matter. This working out was extensively brandished but not yet described quantitatively, all the reproductions of long reasoning training only had one iteration on top of some pretrained model, even o3 isn’t currently known to be based on the same pretrained model as o1.
The AI 2027 story heavily leans into RL training taking off promptly, and it’s possible they are resonating with some insider rumors grounded in reality, but from my point of view it’s too early to tell. I guess in a few months to a year there should be enough public data to tell something, but then again a quantitative model of scaling for MoE (compared to dense) was only published in Jan 2025, even though MoE was already key to original GPT-4 trained in 2022.
Non-Google models of late 2027 use Nvidia Rubin, but not yet Rubin Ultra. Rubin NVL144 racks have the same number of compute dies and chips as Blackwell NVL72 racks (change in the name is purely a marketing thing, they now count dies instead of chips). The compute dies are already almost reticle sized, can’t get bigger, but Rubin uses 3nm (~180M Tr/mm2) while Blackwell is 4nm (~130M Tr/mm2). So the number of transistors per rack goes up according to transistor density between 4nm and 3nm, by 1.4x, plus better energy efficiency enables higher clock speed, maybe another 1.4x, for the total of 2x in performance. The GTC 2025 announcement claimed 3.3x improvement for dense FP8, but based on the above argument it should still be only about 2x for the more transistor-hungry BF16 (comparing Blackwell and Rubin racks).
Abilene site of Stargate[1] will probably have 400K-500K Blackwell chips in 2026, about 1 GW. Nvidia roadmap puts Rubin (VR200 NVL144) 1.5-2 years after Blackwell (GB200 NVL72), which is not yet in widespread use, but will get there soon. So the first models will start being trained on Rubin no earlier than late 2026, much more likely only in 2027, possibly even second half of 2027. Before that, it’s all Blackwell, and if it’s only 1 GW Blackwell training systems[2] in 2026 for one AI company, shortly before 2x better Rubin comes out, then that’s the scale where Blackwell stops, awaiting Rubin and 2027. Which will only be built at scale a bit later still, similarly to how it’s only 100K chips in GB200 NVL72 racks in 2025 for what might be intended to be a single training system, and not yet 500K chips.
This predicts at most 1e28 BF16 FLOPs (2e28 FP8 FLOPs) models in late 2026 (trained on 2 GW of GB200/GB300 NVL72), and very unlikely more than 1e28-4e28 BF16 FLOPs models in late 2027 (1-4 GW Rubin datacenters in late 2026 to early 2027), though that’s alternatively 3e28-1e29 FP8 FLOPs given the FP8/BF16 performance ratio change with Rubin I’m expecting. Rubin Ultra is another big step ~1 year after Rubin, with 2x more compute dies per chip and 2x more chips per rack, so it’s a reason to plan pacing the scaling a bit rather than rushing it in 2026-2027. Such plans will make rushing it more difficult if there is suddenly a reason to do so, and 4 GW with non-Ultra Rubin seems a bit sudden.
So pretty similar to Agent 2 and Agent 4 at some points, keeping to the highest estimates, but with less compute than the plot suggests for months while the next generation of datacenters is being constructed (during the late 2026 to early 2027 Blackwell-Rubin gap).
Technical Claims
Beliefs held by others are a real phenomenon, so tracking them doesn’t give them unearned weight in attention, as long as they are not confused with someone else’s beliefs. You can even learn things specifically for the purpose of changing their simulated mind rather than your own (in whatever direction the winds of evidence happen to blow).
The scale of training and R&D spending by AI companies can be reduced on short notice, while global inference buildout costs much more and needs years of use to pay for itself. So an AI slowdown mostly hurts clouds and makes compute cheap due to oversupply, which might be a wash for AI companies. Confusingly major AI companies are closely tied to cloud providers, but OpenAI is distancing itself from Microsoft, and Meta and xAI are not cloud providers, so wouldn’t suffer as much. In any case the tech giants will survive, it’s losing their favor that seems more likely to damage AI companies, making them no longer able to invest as much in R&D.
https://slatestarcodex.com/2014/07/30/meditations-on-moloch/
It’s “mainstream” here, described well many times before.
if we didn’t have a capitalist system, then the entire point about profit motives, pride, and race dynamics wouldn’t apply
Presence of many nations without a central authority still contributes to race dynamics.
If o3 is based on GPT-4o, there is a reasoning model based on GPT-4.5 that’s better. If o3 is based on GPT-4.5 (and so the reason it’s still not out is that they are waiting for Blackwells to inference it at a reasonable speed and cost), then it was a hasty job just after the base model for GPT-4.5 was pretrained, and so by now they have something much better. Either way, there is some sort of “o4”, but it’s probably a bad look to announce it before releasing the already-announced o3.
Before life, there are only rocks and astronomical objects. Once new things can be created, prior world is relatively unimportant to understand in comparison, because it’s constrained to happenstance of what was there in the past, and there is no similar constraint on what can be created in the future.
Most interesting things are those that get intentionally created with the purpose of being interesting in mind. For any purpose, this or other, that doesn’t end up referencing humanity or the past, it’s possible to create more optimal things in view of that purpose than anything that already happens to exist, because things that happen to exist were never superintelligently optimized to fit that purpose. Humanity is like rocks and astronomical objects, relics that are not optimal in most respects.
Hence “next token predictor” is a bit of a misnomer, as computation on any given token will also try to contribute to prediction of distant future tokens, not just the next one.
Curious if it’s built on the same base model as Gemini 2.0 Pro, or on a completely new pretrained model. With 100K TPUv6e datacenters (about the same as 100K H100 in training compute), Gemini 2.0 Pro seems like it’s underperforming in its weight class, compared to GPT-4.5 and even Grok 3 (likely trained on similar compute). So it makes some sense they’d write it off as a failed run and do another one, but alternatively long reasoning post-training could’ve fixed enough to get a good reasoning model. In which case the name choice breaks the pattern of Gemini 2.0 Flash Thinking, but could be a way of distancing the success of Gemini 2.5 Pro (the reasoning model) from the mediocre performance of Gemini 2.0 Pro (the chat model).
Google’s TPUv6e systems have large scale-up world sizes, unlike Hopper systems. So there is a short window of a few months where they have the advantage in being able to more cheaply inference large reasoning models, unlike everyone else (unless they also use TPUs). Other AI companies would need access to Blackwell NVL36 or NVL72 in order to get reasonable cost and speed of inferencing large reasoning models, and it seems it’ll take another 2-5 months before they are out in force.
Your point is one of the clues I mentioned that I don’t see as comparably strong to the May 2023 paper, when it comes to prediction of loss/perplexity. The framing in your argument appeals to things other than the low-level metric of loss, so I opened my reply with focusing on it rather than the more nebulous things that are actually important in practice. Scaling laws work with loss the best (holding across many OOMs of compute), and repeating 3x rather than 7x (where loss first starts noticeably degrading) gives some margin of error. That is, a theoretical argument along the lines of what you are saying shifts my expectation for 10x-20x repetition (which might degrade faster when working with lower quality data), but not yet for 3x repetition (which I still expect to get an ~unchanged loss).
So far I haven’t even seen anyone there notice that Behemoth means that Llama 4 was essentially canceled and instead we got some sort of Llama 3.5 MoE. That is, a 100K+ H100s training run that was the expected and announced crown jewel of Llama 4 won’t be coming out, probably until at least late 2025 and possibly even 2026. Since Behemoth is the flagship model for Llama 4, a 3e26+ FLOPs model that would’ve been appropriate for a 100K H100s training system instead got pushed back to Llama 5.
As Behemoth is only a 5e25 FLOPs model, even once it comes out it won’t be competing in the same weight class as GPT-4.5, Grok 3, and Gemini 2.5 Pro. Maverick is only a 2e24 FLOPs[1] model (2x less than DeepSeek-V3, ~100x less than the recent frontier models), so of course it’s not very good compared to the frontier models. Since Meta didn’t so far demonstrate competence on the level of DeepSeek or Anthropic, they do need the big compute to remain in the game, and Maverick is certainly not big compute.
(LocalLLaMA specifically is annoyed by absence of models with a small number of total parameters in the current Llama 4 announcement, which means you need high end consumer hardware to run even Scout in 4 bit quantization locally, and datacenter hardware for the rest.)
It’s a 17B active parameter model trained for 22T tokens.