Vladimir_Nesov

Karma: 32,563

Vladimir_Nesov Apr 12, 2025, 3:12 AM
4 points
0
in reply to: Remmelt’s comment on: Crash scenario 1: Rapidly mobilise for a 2025 AI crash

the impact of new Blackwell chips with improved computation

It’s about world size, not computation, and has a startling effect that probably won’t occur again with future chips, since Blackwell sufficiently catches up to models at the current scale.

But even then, OpenAI might get to ~$25bn annualized revenue that won’t be going away

What is this revenue estimate assuming?

The projection for 2025 is $12bn at 3x/year growth (1.1x per month, so $1.7bn per month at the end of 2025, $3bn per month in mid-2026), and my pessimistic timeline above assumes that this continues up to either end of 2025 or mid-2026 and then stops growing after the hypothetical “crash”, which gives $20-36bn per year.

Vladimir_Nesov Apr 11, 2025, 11:12 PM
2 points
0
in reply to: Tapatakt’s comment on: Weird Random Newcomb Problem
Not knowing n(-) results in not knowing expected utility of b (for any given b), because you won’t know how the terms a(n(a), n(a)) are formed.

(And also the whole being given numeric codes of programs as arguments thing gets weird when you are postulated to be unable to interpret what the codes mean. The point of Newcomblike problems is that you get to reason about behavior of specific agents.)

Vladimir_Nesov Apr 11, 2025, 9:37 PM
5 points
2
on: Comments on “AI 2027”

I can’t think of any reason to give a confident, high precision story that you don’t even believe in!

Datapoints generalize, a high precision story holds gears that can be reused in other hypotheticals. I’m not sure what you mean by the story being presented as “confident” (in some sense it’s always wrong to say that a point prediction is “confident” rather than zero probability, even if it’s the mode of a distribution, the most probable point). But in any case I think giving high precision stories is a good methodology for communicating a framing, pointing out which considerations seem to be more important in thinking about possibilities, and also which events (that happen to occur in the story) seem more plausible than their alternatives.

Vladimir_Nesov Apr 11, 2025, 8:48 PM
2 points
0
on: Weird Random Newcomb Problem

Question 1: Assume you are program b. You want to maximize the money you receive. What should you output if your input is (x,x) (i.e., the two numbers are equal)?

Question 2: Assume you are the programmer writing program b. You want to maximize the expected money program b receives. How should you design b to behave when it receives an input (x,x)?

Do you mean to ask how b should behave on input (n(b), n(b)), and how b should be written to behave on input (n(b), n(b)) for that b?

If x differs from n(b), it might matter in some subtle ways but not straightforwardly how b behaves on (x, x), because that never occurs explicitly in the actual thought experiment (where the first argument is always the code for the program itself). And if the programmer knows x before writing b, and x must be equal to n(b), then since n(-) is bijective, they don’t have any choice about how to write b other than to be the preimage of x under n(-).

Vladimir_Nesov Apr 11, 2025, 4:33 PM
9 points
2
in reply to: Thane Ruthenis’s comment on: On Google’s Safety Plan
Official policy documents from AI companies can be useful in bringing certain considerations into the domain of what is allowed to be taken seriously (in particular, by the governments), as opposed to remaining weird sci-fi ideas to be ignored by most Serious People. Even declarations by AI company leaders or Turing award winners of Nobel laureates or some of the most cited AI scientists won’t by themselves have that kind of legitimizing effect. So it’s not necessary for such documents to be able to directly affect actual policies of AI companies, they can still be important in affecting these policies indirectly.

Vladimir_Nesov Apr 11, 2025, 3:05 PM
8 points
4
on: Crash scenario 1: Rapidly mobilise for a 2025 AI crash
I think it’s overdetermined by Blackwell NVL72/NVL36 and long reasoning training that there will be no AI-specific “crash” until at least late 2026. Reasoning models want a lot of tokens, but their current use is constrained by cost and speed, and these issues will be going away to a significant extent. Already Google has Gemini 2.5 Pro (taking advantage of TPUs), and within a few months OpenAI and Anthropic will make reasoning variants of their largest models practical to use as well (those pretrained at the scale of 100K H100s / ~3e26 FLOPs, meaning GPT-4.5 for OpenAI).

The same practical limitations (as well as novelty of the technique) mean that long reasoning models aren’t using as many reasoning tokens as they could in principle, everyone is still at the stage of getting long reasoning traces to work at all vs. not yet, rather than scaling things like the context length they can effectively use (in products rather than only internal research). It’s plausible that contexts with millions of reasoning tokens can be put to good use, where other training methods failed to make contexts at that scale work well.

So later in 2025 there’s better speed and cost, driving demand in terms of the number of prompts/requests, and for early to mid-2026 potentially longer reasoning traces, driving demand in terms of token count. After that, it depends on whether capabilities get much better than Gemini 2.5 Pro. Pretraining scale in deployed models will only advance 2x-5x by mid-2026 compared to now (using 100K-200K Blackwell chip training systems built in 2025), which is not a large enough change to be very noticeable, so it’s not by itself sufficient to prevent a return of late 2024 vaguely pessimistic sentiment, and other considerations might get more sway with funding outcomes. But even then, OpenAI might get to ~$25bn annualized revenue that won’t be going away, and in 2027 or slightly earlier there will be models pretrained for ~4e27 FLOPs using the training systems built in 2025-2026 (400K-600K Blackwell chips, 0.8-1.4 GW, $22-35bn), which as a 10x-15x change (compared to the models currently or soon-to-be deployed in 2025) is significant enough to get noticeably better across the board, even if nothing substantially game-changing gets unlocked. So the “crash” might be about revenue no longer growing 3x per year, and so the next generation training systems built in 2027-2028 not getting to the $150bn scale they otherwise might’ve aspired to.

Vladimir_Nesov Apr 10, 2025, 3:11 PM
5 points
2
in reply to: Benjamin_Todd’s comment on: The case for AGI by 2030
I think the idea of effective FLOPs has more narrow applicability than what you are running with, many things that count as compute multipliers don’t scale. They often only hold for particular capabilities that stop being worth boosting separately at greater levels of scale, or particular data that stops being available in sufficient quantity. An example of a scalable compute multiplier is MoE (even as it destroys data efficiency, and so damages some compute multipliers that rely on selection of high quality data). See Figure 4 in the Mamba paper for another example of a scalable compute multiplier (between GPT-3 transformer and Llama 2 transformer, Transformer and Transformer++ respectively in the figure). This issue is particularly serious when we extrapolate by many OOMs, and I think only very modest compute multipliers (like 1.5x/year) survive across all that, because most things that locally seem like compute multipliers don’t compound very far.

There are also issues I have with Epoch studies on this topic, mainly extrapolating scaling laws from things that are not scalable much at all and weren’t optimized for compute optimality, with trends being defined by limits of scalability and idiosyncratic choices of hyperparameters driven by other concerns, rather than a well-defined and consistent notion of compute optimality (which I’d argue wasn’t properly appreciated and optimized-for by the field until very recently, with Chinchilla still managing to fix glaring errors merely in 2022). Even now, papers arguing compute multipliers keep showing in-training loss/perplexity plots that weren’t cooled down before measurement, I think Figure 11 from the recent OLMo 2 paper illustrates this point brilliantly, showing how what looks like a large compute multiplier^[1] before the learning rate schedule runs its course can lose all effect once it does.
1. ↩︎
  To be fair it’s kind of a toy case where the apparent “compute multiplier” couldn’t seriously be expected to be real, but it does illustrate the issue with the methodology of looking at loss/perplexity plots where the learning rate is still high, or might differ significantly between points being compared on the plots for different architecture variants, hopelessly confounding the calculation of a compute multiplier.

Vladimir_Nesov Apr 9, 2025, 10:53 PM
9 points
0
on: The case for AGI by 2030
spending tens of billions of dollars to build clusters that could train a GPT-6-sized model in 2028

Traditionally steps of GPT series are roughly 100x in raw compute (I’m not counting effective compute, since it’s not relevant to cost of training). GPT-4 is 2e25 FLOPs. Which puts “GPT-6” at 2e29 FLOPs. To train a model in 2028, you would build an Nvidia Rubin Ultra NVL576 (Kyber) training system in 2027. Each rack holds 576 compute dies at about 3e15 BF16 FLOP/s per die^[1] or 1.6e18 FLOP/s per rack. A Blackwell NVL72 datacenter costs about $4M per rack to build, possibly a non-Ultra Rubin NVL144 datacenter will cost about $5M per rack, and a Rubin Ultra NVL576 datacenter might cost about $12M per rack^[2].

To get 2e29 BF16 FLOPs in 4 months at 40% utilization, you’d need 30K racks that would cost about $360B all-in (together with the rest of the training system). Which is significantly more than “tens of billions of dollars”.

GPT-8 would require trillions

“GPT-8” is two steps of 100x in raw compute up from “GPT-6″, at 2e33 FLOPs. You’d need to use 10000x more compute than what $360B buy in 2027. Divide it by how much cheaper that compute gets within a few years, let’s say 8x cheaper. What we get is $450T, which is much more than merely “trillions”, and also technologically impossible to produce at that time without transformative AI.
1. ↩︎
  Chips in Blackwell GB200 systems are manufactured with 4nm process and produce about 2.5 dense BF16 FLOP/s per chip, with each chip holding 2 almost reticle sized compute dies. Rubin moves to 3nm, compared to Blackwell at 4nm, which makes each die about 2x more performant (from more transistors and higher clock speed, but the die size must remain the same), which predicts about 2.5 dense BF16 FLOP/s per die or 5 BF16 FLOP/s per 2-die chip. (Nvidia announced that dense FP8 performance will increase 3.3x, but that’s probably due to giving more transistors to FP8, which can’t be done as much for BF16 since it already needs a lot.)
  
  To separately support this, today Google announced Ironwood, their 7th generation of TPU (that might go into production in late 2026). The announcement includes a video that shows that it’s a 2-die chip, same as non-Ultra Rubin, and it was also previously reported to be manufactured with 3nm. In today’s announcement, its performance is quoted as 4.6e15 FLOP/s, which from context of comparing with 459e12 FLOP/s of TPUv5p is likely dense BF16. This means 2.3e15 dense BF16 FLOP/s per compute die, close to my estimate for a Rubin compute die.
  
  A Kyber rack was announced to need 600 KW per rack (1.04 KW/die within-rack all-in), compared to Blackwell NVL72 at 120-130 KW per rack (0.83-0.90 KW/die within-rack all-in). Earlier non-Ultra Rubin NVL144 is a rack with the same number of chips and compute dies as Blackwell NVL72, so it might be using at most slightly higher power per compute die (let’s say 0.90 KW/die within-rack all-in). Thus the clock speed for Rubin Ultra might be up to ~1.15x higher than for non-Ultra Rubin, meaning performance of Rubin Ultra might reach 2.9e15 dense BF16 FLOP/s per die (12e15 FLOP/s per chip, 1.6e18 FLOP/s per rack).
2. ↩︎
  In a Rubin Ultra NVL576 rack, chips have 4 compute dies each, compared to only 2 dies per chip in a non-Ultra Rubin NVL144 rack. Since Nvidia sells at a large margin per compute die, and its real product is the whole system rather than the individual compute dies, it can afford to keep cutting the margin per die, while the cost of the rest of the system scales with the number of chips rather then the number of dies. The NVL576 rack has 2x more chips than the ~$5M NVL144 rack, so if the cost per chip only increases slightly, we get $12M per rack.

Vladimir_Nesov Apr 9, 2025, 9:23 PM
4 points
10
in reply to: Noosphere89’s comment on: AI 2027: What Superintelligence Looks Like

probability mass for AI that can automate all AI research is in the 2030s … broadly due to the tariffs and …

Without AGI, scaling of hardware runs into the financial ~$200bn individual training system cost wall in 2027-2029. Any tribulations on the way (or conversely efforts to pool heterogeneous and geographically distributed compute) only delay that point slightly (when compared to the current pace of increase in funding), and you end up in approximately the same place, slowing down to the speed of advancement in FLOP/s per watt (or per dollar). Without transformative AI, anything close to the current pace is unlikely to last into the 2030s.

Vladimir_Nesov Apr 9, 2025, 4:34 PM
LW: 3 AF: 2
0
AF
in reply to: abramdemski’s comment on: abramdemski’s Shortform
With AI assistance, the degree to which an alternative is ready-to-go can differ a lot compared to its prior human-developed state. Also, an idea that’s ready-to-go is not yet an edifice of theory and software that’s ready-to-go in replacing 5e28 FLOPs transformer models, so some level of AI assistance is still necessary with 2 year timelines. (I’m not necessarily arguing that 2 year timelines are correct, but it’s the kind of assumption that my argument should survive.)

The critical period includes the time when humans are still in effective control of the AIs, or when vaguely aligned and properly incentivised AIs are in control and are actually trying to help with alignment, even if their natural development and increasing power would end up pushing them out of that state soon thereafter. During this time, the state of current research culture shapes the path-dependent outcomes. Superintelligent AIs that are reflectively stable will no longer allow path dependence in their further development, but before that happens the dynamics can be changed to an arbitrary extent, especially with AI efforts as leverage in implementing the changes in practice.

Vladimir_Nesov Apr 9, 2025, 2:13 PM
19 points
5
on: Llama Does Not Look Good 4 Anything
The most important thing about Llama 4 is that the 100K H100s run that was promised got canceled, and its flagship model Behemoth will be a 5e25 FLOPs compute optimal model^[1] rather than a ~3e26 FLOPs model that a 100K H100s training system should be able to produce. This is merely 35% more compute than Llama-3-405B from last year, while GPT-4.5, Grok 3 and Gemini 2.5 Pro are probably around 3e26 FLOPs or a bit more. They even explicitly mention that it was trained on 32K GPUs (which must be H100s). Since Behemoth is the flagship model, a bigger model got pushed back to Llama 5, which will only come out much later, possibly not even this year.

In contrast, capabilities of Maverick are unsurprising and prompt no updates. It’s merely a 2e24 FLOPs ~7x overtrained model^[2], which is 2x less compute than DeepSeek-V3 and 100x less than the recent frontier models, and also it’s not a reasoning model for now. So of course it’s not very good. If it was very good with this little compute, that would be a feat on the level of Anthropic or DeepSeek, which would be a positive update about Meta’s model training competence, but this unexpected thing merely didn’t happen, so nothing to see here, what are people even surprised about (except some benchmarking shenanigans).

To the extent Llamas 1-3 were important open weights releases that could be run by normal people locally, Llama 4 does seem disappointing, because there are no small models (in total params), though as Llama 3.2 demonstrated this might change shortly. Even the smallest Scout model still has 109B total params, meaning a 4 bit quantized version might fit on high end consumer hardware, but all the rest is only practical with datacenter hardware.
1. ↩︎
  288B active params, 30T training tokens gives 5.2e25 FLOPs by 6ND. At 1:8 sparsity (2T total tokens, maybe ~250T active params within experts), data efficiency is 3x lower than for a dense model, and for Llama-3-405B the compute optimal amount of data was 40 tokens per param. This means that about 120 tokens per param would be optimal for Behemoth, and in fact it has 104 tokens per active param, so it’s not overtrained.
2. ↩︎
  17B active params, 22T tokens, which is 2.25e24 FLOPs by 6ND, and 1300 tokens per active param. It’s a weird mix of dense and MoE, so the degree of its sparsity probably doesn’t map to measurements for pure MoE, but at ~1:23 sparsity (from 400B total params) it might be ~5x less data efficient than dense, predicting ~200 tokens per param compute optimal, meaning 1300 tokens per param give ~7x overtraining.

Vladimir_Nesov Apr 9, 2025, 12:43 AM
6 points
0
in reply to: Cole Wyeth’s comment on: abramdemski’s Shortform

haven’t heard this said explicitly before

Okay, this prompted me to turn the comment into a post, maybe this point is actually new to someone.

Short Timelines Don’t Devalue Long Horizon Research

Vladimir_NesovApr 9, 2025, 12:42 AM

156 points

20 comments1 min readLW link

Vladimir_Nesov Apr 8, 2025, 7:51 PM
LW: 6 AF: 4
3
AF
in reply to: abramdemski’s comment on: abramdemski’s Shortform

prioritization depends in part on timelines

Any research rebalances the mix of currently legible research directions that could be handed off to AI-assisted alignment researchers or early autonomous AI researchers whenever they show up. Even hopelessly incomplete research agendas could still be used to prompt future capable AI to focus on them, while in the absence of such incomplete research agendas we’d need to rely on AI’s judgment more completely. So it makes sense to still prioritize things that have no hope at all of becoming practical for decades (with human effort), to make as much partial progress as possible in developing (and deconfusing) them in the next few years.

In this sense current human research, however far from practical usefulness, forms the data for alignment of the early AI-assisted or AI-driven alignment research efforts. The judgment of human alignment researchers who are currently working makes it possible to formulate more knowably useful prompts for future AIs that nudge them in the direction of actually developing practical alignment techniques.

Vladimir_Nesov Apr 8, 2025, 6:41 PM
3 points
0
in reply to: SorenJ’s comment on: AI 2027: What Superintelligence Looks Like
“Revenue by 2027.5” needs to mean “revenue between summer 2026 and summer 2027″. And the time when the $150bn is raised needs to be late 2026, not “2027.5”, in order to actually build the thing by early 2027 and have it completed for several months already by mid to late 2027 to get that 5e28 BF16 FLOPs model. Also Nvidia would need to have been expecting this or similar sentiment elsewhere months to years in advance, as everyone in the supply chain can be skeptical that this kind of money actually materializes by 2027, and so that they need to build additional factories in 2025-2026 to meet the hypothetical demand of 2027.

By “used for inference,” this just means basically letting people use the model?

It means using the compute to let people use various models, not necessarily this one, while the model itself might end up getting inferenced elsewhere. Numerous training experiments can also occupy a lot of GPU-time, but they will be smaller than the largest training run, and so the rest of the training system can be left to do other things. In principle some AI companies might offer cloud provider services and sell the time piecemeal on the older training systems that are no longer suited for training frontier models, but very likely they have a use for all that compute themselves.

Vladimir_Nesov Apr 8, 2025, 5:03 PM
5 points
0
in reply to: SorenJ’s comment on: AI 2027: What Superintelligence Looks Like
A 100K H100s training system is a datacenter campus that costs about $5bn to build. You can use it to train a 3e26 FLOPs model in ~3 months, and that time costs about $500M. So the “training cost” is $500M, not $5bn, but in order to do the training you need exclusive access to a giant 100K H100s datacenter campus for 3 months, which probably means you need to build it yourself, which means you still need to raise the $5bn. Outside these 3 months, it can be used for inference or training experiments, so the $5bn is not wasted, it’s just a bit suboptimal to build that much compute in a single place if your goal is primarily to serve inference around the world, because it will be quite far from most places in the world. (The 1e27 estimate is the borderline implausible upper bound, and it would take more than $500M in GPU-time to reach, 3e26 BF16 FLOPs or 6e26 FP8 FLOPs are more likely with just the Goodyear campus).

Abilene site of Stargate is only building about 100K chips (2 buildings, ~1500 Blackwell NVL72 racks, ~250 MW, ~$6bn) by summer 2025, most of the rest of the 1.2 GW buildout happens in 2026. The 2025 system is sufficient to train a 1e27 BF16 FLOPs model (or 2e27 FP8 FLOPs).

Rubin arriving 1.5 years after Blackwell means you have 1.5 years of revenue growth to use as an argument about valuation to raise money for Rubin, not 1 year. The recent round raised money for a $30bn datacenter campus, so if revenue actually keeps growing at 3x per year, then it’ll grow 5x in 1.5 years. As the current expectation is $12bn, in 1.5 years the expectation would be $60bn (counting from an arbitrary month, without sticking to calendar years). And 5x of $30bn is $150bn, anchoring to revenue growth, though actually raising this kind of absurd amount of money is a separate matter that also needs to happen.

If miraculously Nvidia actually ships 30K Rubin racks in early 2027 (to a single customer), training will only happen a bit later, that is you’ll only have an actual 5e28 BF16 FLOPs model by mid-2027, not in 2026. Building the training system costs $150bn, but the minimum necessary cost of 3-4 months of training system’s time is only about $15bn.

More likely this only happens several months later, in 2028, and at that point there’s the better Rubin Ultra NVL576 (Kyber) coming out, so that’s a reason to avoid tying up the $150bn in capital in the inferior non-Ultra Rubin NVL144 racks and instead wait for Rubin Ultra, only expending somewhat less than $150bn on non-Ultra Rubin NVL144 in 2027, meaning only a ~2e28 BF16 FLOPs model in 2027 (and at this lower level of buildout it’s more likely to actually happen in 2027). Of course the AI 2027 timeline assumes all-encompassing capability progress in 2027, which means AI companies won’t be saving money for next year, but hardware production still needs to ramp, money won’t be able to speed it up that much on the timescale of months.

Vladimir_Nesov Apr 8, 2025, 3:34 AM
7 points
0
in reply to: Fergus Argyll’s comment on: AI 2027: What Superintelligence Looks Like
GPT-4.5 might’ve been trained on 100K H100s of the Goodyear Microsoft site ($4-5bn, same as first phase of Colossus), about 3e26 FLOPs (though there are hints in the announcement video it could’ve been trained in FP8 and on compute from more than one location, which makes up to 1e27 FLOPs possible in principle).

Abilene site of Crusoe/Stargate/OpenAI will have 1 GW of Blackwell servers in 2026, about 6K-7K racks, possibly at $4M per rack all-in, for the total of $25-30bn, which they’ve already raised money for (mostly from SoftBank). They are projecting about $12bn in revenue for 2025. If used as a single training system, it’s enough to train models for 5e27 BF16 FLOPs (or 1e28 FP8 FLOPs).

The AI 2027 timeline assumes reliable agentic models work out, so revenue continues scaling, with the baseline guess of 3x per year. If Rubin NVL144 arrives 1.5 years after Blackwell NVL72, that’s about 5x increase in expected revenue. If that somehow translates into proportional investment in datacenter construction, that might be enough to buy $150bn worth of Rubin NVL144 racks, say at $5M per rack all-in, which is 30K racks and 5 GW. Compared to Blackwell NVL72, that’s 2x more BF16 compute per rack (and 3.3x more FP8 compute). This makes the Rubin datacenter of early 2027 sufficient to train a 5e28 BF16 FLOPs model (or 1.5e29 FP8 FLOPs) later in 2027. Which is a bit more than 100x the estimate for GPT-4.5.

(I think this is borderline implausible technologically if only the AI company believes in the aggressive timeline in advance, and ramping Rubin to 30K racks for a single company will take more time. Getting 0.5-2 GW of Rubin racks by early 2027 seems more likely. Using Blackwell at that time means ~2x lower performance for the same money, undercutting the amount of compute that will be available in 2027-2028 in the absence of an intelligence explosion, but at least it’s something money will be able to buy. And of course this still hinges on the revenue actually continuing to grow, and translating into capital for the new datacenter.)

Vladimir_Nesov Apr 7, 2025, 4:08 PM
3 points
0
in reply to: Petropolitan’s comment on: Meta releases Llama-4 herd of models
Your point is one of the clues I mentioned that I don’t see as comparably strong to the May 2023 paper, when it comes to prediction of loss/perplexity. The framing in your argument appeals to things other than the low-level metric of loss, so I opened my reply with focusing on it rather than the more nebulous things that are actually important in practice. Scaling laws work with loss the best (holding across many OOMs of compute), and repeating 3x rather than 7x (where loss first starts noticeably degrading) gives some margin of error. That is, a theoretical argument along the lines of what you are saying shifts my expectation for 10x-20x repetition (which might degrade faster when working with lower quality data), but not yet for 3x repetition (which I still expect to get an ~unchanged loss).

Also, check out https://www.reddit.com/r/LocalLLaMA, they are very disappointed how bad the released models turned out to be (yeah I know that’s not directly indicative of Behemoth performance)

So far I haven’t even seen anyone there notice that Behemoth means that Llama 4 was essentially canceled and instead we got some sort of Llama 3.5 MoE. That is, a 100K+ H100s training run that was the expected and announced crown jewel of Llama 4 won’t be coming out, probably until at least late 2025 and possibly even 2026. Since Behemoth is the flagship model for Llama 4, a 3e26+ FLOPs model that would’ve been appropriate for a 100K H100s training system instead got pushed back to Llama 5.

As Behemoth is only a 5e25 FLOPs model, even once it comes out it won’t be competing in the same weight class as GPT-4.5, Grok 3, and Gemini 2.5 Pro. Maverick is only a 2e24 FLOPs^[1] model (2x less than DeepSeek-V3, ~100x less than the recent frontier models), so of course it’s not very good compared to the frontier models. Since Meta didn’t so far demonstrate competence on the level of DeepSeek or Anthropic, they do need the big compute to remain in the game, and Maverick is certainly not big compute.

(LocalLLaMA specifically is annoyed by absence of models with a small number of total parameters in the current Llama 4 announcement, which means you need high end consumer hardware to run even Scout in 4 bit quantization locally, and datacenter hardware for the rest.)
1. ↩︎
  It’s a 17B active parameter model trained for 22T tokens.

Vladimir_Nesov Apr 7, 2025, 3:30 PM
2 points
0
in reply to: Roman Leventov’s comment on: An Optimistic 2027 Timeline
I meant “realiable agents” in the AI 2027 sense, that is something on the order of being sufficient for automated AI research, leading to much more revenue and investment in the lead-up rather than stalling at ~$100bn per individual training system for multiple years. My point is that it’s not currently knowable if it happens imminently in 2026-2027 or at least a few years later, meaning I don’t expect that evidence exists that distinguishes these possibilities even within the leading AI companies.

Vladimir_Nesov Apr 7, 2025, 3:10 PM
4 points
0
in reply to: Paragox’s comment on: An Optimistic 2027 Timeline
The reason Rubin NVL576 probably won’t help as much as the current transition from Hopper is that Blackwell NVL72 is already ~sufficient for the model sizes that are compute optimal to train on $30bn Blackwell training systems (which Rubin NVL144 training systems probably won’t significantly leapfrog before Rubin NVL576 comes out, unless there are reliable agents in 2026-2027 and funding goes through the roof).

when we get 576 (194 gpus)

The terminology Huang was advocating for at GTC 2025 (at 1:28:04) is to use “GPU” to refer to compute dies rather than chips/packages, and in these terms a Rubin NVL576 rack has 144 chips and 576 GPUs, rather than 144 GPUs. Even though this seems contentious, the terms compute die and chip/package remain less ambiguous than “GPU”.

Vladimir_Nesov

Short Timelines Don’t De­value Long Hori­zon Research

Short Timelines Don’t Devalue Long Horizon Research