Vladimir_Nesov

Karma: 32,813

Vladimir_Nesov Apr 16, 2025, 11:44 PM
8 points
2
in reply to: lc’s comment on: lc’s Shortform
My first impression of o3 (as available via Chatbot Arena) is that when I’m showing it my AI scaling analysis comments (such as this and this), it responds with confident unhinged speculation teeming with hallucinations, compared to the other recent models that usually respond with bland rephrasings that get almost everything correctly with a few minor hallucinations or reasonable misconceptions carrying over from their outdated knowledge.

Don’t know yet if it’s specific to speculative/forecasting discussions, but it doesn’t look good (for faithfulness of arguments) when combined with good performance on benchmarks. Possibly stream of consciousness style data is useful to write down within long reasoning traces and can add up to normality for questions with a short final answer, but results in spurious details within confabulated summarized arguments for that answer (outside the hidden reasoning trace) that aren’t measured by hallucination benchmarks and so allowed to get worse. Though in the o3 System Card hallucination rate also significantly increased compared to o1 (Section 3.3).

Vladimir_Nesov Apr 16, 2025, 9:37 PM
17 points
0
on: GPT-4.1 Is a Mini Upgrade

Will Brown: it’s simple, really. GPT-4.1 is o3 without reasoning … o1 is 4o with reasoning … and o4 is GPT-4.5 with reasoning.

Price and knowledge cutoff for o3 strongly suggest it’s indeed GPT-4.1 with reasoning. And so again we don’t get to see the touted scaling of reasoning models, since the base model got upgraded instead of remaining unchanged. (I’m getting the impression that GPT-4.5 with reasoning is going to be called “GPT-5” rather than “o4″, similarly to how Gemini 2.5 Pro is plausibly Gemini 2.0 Pro with reasoning.)

In any case, the fact that o3 is not GPT-4.5 with reasoning means that there is still no word on what GPT-4.5 with reasoning is capable of. For Anthropic, Sonnet 3.7 with reasoning is analogous to o1 (it’s built on the base model of the older Sonnet 3.5, similarly to how o1 is built on the base model of GPT-4o). Internally, they probably already have a reasoning model for some larger Opus model (analogous to GPT-4.5) and for a newer Sonnet (analogous to GPT-4.1) with a newer base model different from that of Sonnet 3.5.

This also makes it less plausible that Gemini 2.5 Pro is based on a GPT-4.5 scale model (even though TPUs might’ve been able to make its price/speed possible even if it was), so there might be a Gemini 2.0 Ultra internally after all, at least as a base model. One of the new algorithmic secrets disclosed in Gemma 3 report was that pretraining knowledge distillation works even when the teacher model is much larger (rather than modestly larger) than the student model, it just needs to be trained for enough tokens for this to become an advantage rather than a disadvantage (Figure 8), something that for example Llama 3.2 from Sep 2024 still wasn’t taking advantage of. This makes it useful to train the largest possible compute optimal base model regardless of whether its better quality justifies its inference cost, merely to make the smaller overtrained base models better by pretraining them from the large model logits with knowledge distillation instead of from raw tokens.
What links here?
- Vladimir_Nesov's comment on Shortform by lc (Apr 16, 2025, 11:44 PM; 8 points)

Vladimir_Nesov Apr 15, 2025, 7:39 PM
10 points
0
in reply to: Kaj_Sotala’s comment on: Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI

To me these kinds of failures feel more “seem to be at the core of the way LLMs reason”.

Right, I was more pointing out that if the analogy holds to some extent, then long reasoning training is crucial as the only locus of feedback (and also probably insufficient in current quantities relative to pretraining). The analogy I intended is this being a perception issue that can be worked around without too much fundamental difficulty, but only with sufficient intentional caution. Humans have the benefit of lifelong feedback and optimization by evolution, so LLMs with no feedback whatsoever might suffer much more from a similar issue, and the severity of its impact doesn’t strongly argue its different nature.

I was specifically talking about whether scaling just base models seems enough to solve the issue

To the extent long reasoning training might elicit relevant things, base model scaling shouldn’t be evaluated without it. Some capabilities that are made available by scaling the base model will only become visible after long reasoning training elicits them.
What links here?
- Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI by Kaj_Sotala (Apr 15, 2025, 3:56 PM; 162 points)

Vladimir_Nesov Apr 15, 2025, 6:18 PM
24 points
4
on: Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI

the fact that e.g. GPT-4.5 was disappointing

It’s not a reasoning variant though, the only credible reasoning model at the frontier ~100K H100s scale that’s currently available is Gemini 2.5 Pro (Grok 3 seems to have poor post-training, and is suspiciously cheap/fast without Blackwell or presumably TPUs, so likely rather overtrained). Sonnet 3.7 is a very good GPT-4 scale reasoning model, and the rest are either worse or trained for even less compute or both. These weird failures might be analogous to optical illusions (but they are textual, not known to human culture, and therefore baffling), in which case working around them needs some form of feedback in the training process, and currently only long reasoning training plausibly offers relevant feedback.

Reasoning traces themselves haven’t yet been scaled beyond 30K-70K tokens, benchmarks show strong improvement with reasoning trace length and it should work until at least ~1M (since non-reasoning models are sometimes able to handle 100K-200K input tokens with an OK quality). This is 20x more, the same as the distance between 2K and 40K reasoning tokens. And Blackwell (NVL72/NVL36, not DGX/HGX B200) that will enable this or use of larger models for AI companies that are not Google is still getting installed.

So my crux for the observations of this post is whether it ages well enough to survive late 2025 (though 1M token reasoning traces might take longer, maybe first half of 2026, at which point we’ll also start seeing models from ~100K Blackwell chip scale training runs).

Vladimir_Nesov Apr 15, 2025, 4:48 PM
5 points
0
in reply to: Randaly’s comment on: Comments on “AI 2027”
I see what you mean (I did mostly change the topic to the slowdown hypothetical). There is another strange thing about AI companies, I think giving ~50% in cost of inference too much precision in the foreseeable future is wrong, as it’s highly uncertain and malleable in a way that’s hard for even the company itself to anticipate.

About ~2x difference in inference cost (or size of a model) can be merely hard to notice when nothing substantial changes in the training recipe (and training cost), and better post-training (which is relatively cheap) can get that kind of advantage or more, but not reliably. Pretraining knowledge distillation might get another ~1.5x at the cost of training a larger teacher model (plausibly GPT-4.1 has this because of the base model for GPT-4.5, but GPT-4o doesn’t). And there are all the other compute multipliers that become less fake if the scale stops advancing. The company itself won’t be able to plan with any degree of certainty how good its near future models will be relative to their cost, or how much its competitors will be able to cut prices. So the current state of cost of inference doesn’t seem like a good anchor for where it might settle in the slowdown timelines.

Vladimir_Nesov Apr 15, 2025, 1:43 AM
6 points
3
in reply to: AnthonyC’s comment on: Reactions to METR task length paper are insane

We use reasoning models with more inference time compute to generate better data to train better base models to more efficiently reproduce similar capability levels with less compute to build better reasoning models.

This kind of thing isn’t known to meaningfully work, as something that can potentially be done on pretraining scale. It also doesn’t seem plausible without additional breakthroughs given the nature and size of verifiable task datasets, with things like o3-mini getting ~matched on benchmarks by post-training on datasets containing 15K-120K problems. All the straight lines for reasoning models so far are only about scaling a little bit, using scarce resources that might run out (verifiable problems that help) and untried-at-more-scale algorithms that might break down (in a way that’s hard to fix). So the known benefit is still plausible to remain a one-time improvement, extending it significantly (into becoming a new direction of scaling) hasn’t been demonstrated.

I think even remaining as a one-time improvement, long reasoning training might still be sufficient to get AI takeoff within a few years just from pretraining scaling of the underlying base models, but that’s not the same as already believing that RL post-training actually scales very far by itself. Most plausibly it does scale with more reasoning tokens in a trace, getting from the current ~50K to ~1M, but that’s separate from scaling with RL training all the way to pretraining scale (and possibly further).

Vladimir_Nesov Apr 15, 2025, 12:12 AM
4 points
0
in reply to: Randaly’s comment on: Comments on “AI 2027”

OpenAI continuing to lose money

They are losing money only if you include all the R&D (where the unusual thing is very expensive training compute for experiments), which is only important while capabilities keep improving. If/when the capabilities stop improving quickly, somewhat cutting research spending won’t affect their standing in the market that much. And also after revenue grows some more, essential research (in the slow capability growth mode) will consume a smaller fraction. So it doesn’t seem like they are centrally “losing money”, the plausible scenarios still end in profitability (where they don’t end the world) if they don’t lose the market for normal reasons like failing on products or company culture.

OpenAI cannot raise exponentially more money without turning a profit, which it cannot do

This does seem plausible in some no-slowdown worlds (where they ~can’t reduce R&D spending in order to start turning profit), if in fact more investors don’t turn up there. On the other hand, if every AI company is forced to reduce R&D spending because they can’t raise money to cover it, then they won’t be outcompeted by a company that keeps R&D spending flowing, because such a competitor won’t exist.

Vladimir_Nesov Apr 14, 2025, 2:36 PM
8 points
0
in reply to: Knight Lee’s comment on: Commitment Races are a technical problem ASI can easily solve

in real life no intelligent being … can convert themselves into a rock

if they become a rock … the other players will not know it

Refusing in the ultimatum game punishes the prior decision to be unfair, not what remains after the decision is made. It doesn’t matter if what remains is capable of making further decisions, the negotiations backed by ability to refuse an unfair offer are not with them, but with the prior decision maker that created them.

If you convert yourself into a rock (or a utility monster), it’s the decision to convert yourself that’s the opponent of refusal to accept the rock’s offer, the rock is not the refusal’s opponent, even as the refusal is being performed against a literal rock. Predictions about the other players turn anti-inductive when they get exploited, exploiting a prediction about behavior too much makes it increasingly incorrect, since the behavior adapts in response to exploitation starting to show up in the prior. If most rocks that enter the ultimatum game are remains of former unfair decision makers with the rocks’ origins perfectly concealed (as a ploy to make it so that the other player won’t suspect anything and so won’t refuse), then this general fact makes the other player suspect all rocks and punish their possible origins, destroying the premise of not-knowing necessary for the strategy of turning yourself into a rock to shield the prior unfair decision makers from negotiations.

Vladimir_Nesov Apr 13, 2025, 8:03 PM
14 points
4
in reply to: Cole Wyeth’s comment on: Cole Wyeth’s Shortform
LW doesn’t punish, it upvotes-if-interesting and then silently judges.

confidence / effort ratio

(Effort is not a measure of value, it’s a measure of cost.)

Vladimir_Nesov Apr 13, 2025, 7:42 PM
7 points
0
in reply to: Knight Lee’s comment on: Commitment Races are a technical problem ASI can easily solve

The other side is forced to agree to that, just to get a little.

That’s not how the ultimatum game works in non-CDT settings, you can still punish the opponent for offering too little, even at the cost of getting nothing in the current possible world (thereby reducing its weight and with it the expected cost). In this case it deters commitment racing.

Vladimir_Nesov Apr 13, 2025, 3:19 PM
3 points
0
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
The term is a bit conflationary. Persuasion for the masses is clearly a thing, its power is coordination of many people and turning their efforts to (in particular) enforce and propagate the persuasion (this works even for norms that have no specific persuader that originates them, and contingent norms that are not convergently generated by human nature). Individual persuasion with a stronger effect that can defeat specific people is probably either unreliable like cults or conmen (where many people are much less susceptible than some, and objective deception is necessary), or takes the form of avoidable dangers like psychoactive drugs: if you are not allowed to avoid exposure, then you have a separate problem that’s arguably more severe.

With AI, it’s plausible that coordinated persuasion of many people can be a thing, as well as it being difficult in practice for most people to avoid exposure. So if AI can achieve individual persuasion that’s a bit more reliable and has a bit stronger effect than that of the most effective human practitioners who are the ideal fit for persuading the specific target, it can then apply it to many people individually, in a way that’s hard to avoid in practice, which might simultaneously get the multiplier of coordinated persuasion by affecting a significant fraction of all humans in the communities/subcultures it targets.

Vladimir_Nesov Apr 12, 2025, 3:12 AM
6 points
0
in reply to: Remmelt’s comment on: Crash scenario 1: Rapidly mobilise for a 2025 AI crash

the impact of new Blackwell chips with improved computation

It’s about world size, not computation, and has a startling effect that probably won’t occur again with future chips, since Blackwell sufficiently catches up to models at the current scale.

But even then, OpenAI might get to ~$25bn annualized revenue that won’t be going away

What is this revenue estimate assuming?

The projection for 2025 is $12bn at 3x/year growth (1.1x per month, so $1.7bn per month at the end of 2025, $3bn per month in mid-2026), and my pessimistic timeline above assumes that this continues up to either end of 2025 or mid-2026 and then stops growing after the hypothetical “crash”, which gives $20-36bn per year.

Vladimir_Nesov Apr 11, 2025, 11:12 PM
2 points
0
in reply to: Tapatakt’s comment on: Weird Random Newcomb Problem
Not knowing n(-) results in not knowing expected utility of b (for any given b), because you won’t know how the terms a(n(a), n(a)) are formed.

(And also the whole being given numeric codes of programs as arguments thing gets weird when you are postulated to be unable to interpret what the codes mean. The point of Newcomblike problems is that you get to reason about behavior of specific agents.)

Vladimir_Nesov Apr 11, 2025, 9:37 PM
6 points
2
on: Comments on “AI 2027”

I can’t think of any reason to give a confident, high precision story that you don’t even believe in!

Datapoints generalize, a high precision story holds gears that can be reused in other hypotheticals. I’m not sure what you mean by the story being presented as “confident” (in some sense it’s always wrong to say that a point prediction is “confident” rather than zero probability, even if it’s the mode of a distribution, the most probable point). But in any case I think giving high precision stories is a good methodology for communicating a framing, pointing out which considerations seem to be more important in thinking about possibilities, and also which events (that happen to occur in the story) seem more plausible than their alternatives.

Vladimir_Nesov Apr 11, 2025, 8:48 PM
3 points
0
on: Weird Random Newcomb Problem

Question 1: Assume you are program b. You want to maximize the money you receive. What should you output if your input is (x,x) (i.e., the two numbers are equal)?

Question 2: Assume you are the programmer writing program b. You want to maximize the expected money program b receives. How should you design b to behave when it receives an input (x,x)?

Do you mean to ask how b should behave on input (n(b), n(b)), and how b should be written to behave on input (n(b), n(b)) for that b?

If x differs from n(b), it might matter in some subtle ways but not straightforwardly how b behaves on (x, x), because that never occurs explicitly in the actual thought experiment (where the first argument is always the code for the program itself). And if the programmer knows x before writing b, and x must be equal to n(b), then since n(-) is bijective, they don’t have any choice about how to write b other than to be the preimage of x under n(-).

Vladimir_Nesov Apr 11, 2025, 4:33 PM
14 points
2
in reply to: Thane Ruthenis’s comment on: On Google’s Safety Plan
Official policy documents from AI companies can be useful in bringing certain considerations into the domain of what is allowed to be taken seriously (in particular, by the governments), as opposed to remaining weird sci-fi ideas to be ignored by most Serious People. Even declarations by AI company leaders or Turing award winners of Nobel laureates or some of the most cited AI scientists won’t by themselves have that kind of legitimizing effect. So it’s not necessary for such documents to be able to directly affect actual policies of AI companies, they can still be important in affecting these policies indirectly.

Vladimir_Nesov Apr 11, 2025, 3:05 PM
11 points
4
on: Crash scenario 1: Rapidly mobilise for a 2025 AI crash
I think it’s overdetermined by Blackwell NVL72/NVL36 and long reasoning training that there will be no AI-specific “crash” until at least late 2026. Reasoning models want a lot of tokens, but their current use is constrained by cost and speed, and these issues will be going away to a significant extent. Already Google has Gemini 2.5 Pro (taking advantage of TPUs), and within a few months OpenAI and Anthropic will make reasoning variants of their largest models practical to use as well (those pretrained at the scale of 100K H100s / ~3e26 FLOPs, meaning GPT-4.5 for OpenAI).

The same practical limitations (as well as novelty of the technique) mean that long reasoning models aren’t using as many reasoning tokens as they could in principle, everyone is still at the stage of getting long reasoning traces to work at all vs. not yet, rather than scaling things like the context length they can effectively use (in products rather than only internal research). It’s plausible that contexts with millions of reasoning tokens can be put to good use, where other training methods failed to make contexts at that scale work well.

So later in 2025 there’s better speed and cost, driving demand in terms of the number of prompts/requests, and for early to mid-2026 potentially longer reasoning traces, driving demand in terms of token count. After that, it depends on whether capabilities get much better than Gemini 2.5 Pro. Pretraining scale in deployed models will only advance 2x-5x by mid-2026 compared to now (using 100K-200K Blackwell chip training systems built in 2025), which is not a large enough change to be very noticeable, so it’s not by itself sufficient to prevent a return of late 2024 vaguely pessimistic sentiment, and other considerations might get more sway with funding outcomes. But even then, OpenAI might get to ~$25bn annualized revenue that won’t be going away, and in 2027 or slightly earlier there will be models pretrained for ~4e27 FLOPs using the training systems built in 2025-2026 (400K-600K Blackwell chips, 0.8-1.4 GW, $22-35bn), which as a 10x-15x change (compared to the models currently or soon-to-be deployed in 2025) is significant enough to get noticeably better across the board, even if nothing substantially game-changing gets unlocked. So the “crash” might be about revenue no longer growing 3x per year, and so the next generation training systems built in 2027-2028 not getting to the $150bn scale they otherwise might’ve aspired to.
What links here?
- Vladimir_Nesov's comment on Shortform by lc (Apr 16, 2025, 11:44 PM; 8 points)
- Vladimir_Nesov's comment on Comments on “AI 2027” by Randaly (Apr 23, 2025, 4:58 AM; 4 points)

Vladimir_Nesov Apr 10, 2025, 3:11 PM
5 points
2
in reply to: Benjamin_Todd’s comment on: The case for AGI by 2030
I think the idea of effective FLOPs has more narrow applicability than what you are running with, many things that count as compute multipliers don’t scale. They often only hold for particular capabilities that stop being worth boosting separately at greater levels of scale, or particular data that stops being available in sufficient quantity. An example of a scalable compute multiplier is MoE (even as it destroys data efficiency, and so damages some compute multipliers that rely on selection of high quality data). See Figure 4 in the Mamba paper for another example of a scalable compute multiplier (between GPT-3 transformer and Llama 2 transformer, Transformer and Transformer++ respectively in the figure). This issue is particularly serious when we extrapolate by many OOMs, and I think only very modest compute multipliers (like 1.5x/year) survive across all that, because most things that locally seem like compute multipliers don’t compound very far.

There are also issues I have with Epoch studies on this topic, mainly extrapolating scaling laws from things that are not scalable much at all and weren’t optimized for compute optimality, with trends being defined by limits of scalability and idiosyncratic choices of hyperparameters driven by other concerns, rather than a well-defined and consistent notion of compute optimality (which I’d argue wasn’t properly appreciated and optimized-for by the field until very recently, with Chinchilla still managing to fix glaring errors merely in 2022). Even now, papers arguing compute multipliers keep showing in-training loss/perplexity plots that weren’t cooled down before measurement, I think Figure 11 from the recent OLMo 2 paper illustrates this point brilliantly, showing how what looks like a large compute multiplier^[1] before the learning rate schedule runs its course can lose all effect once it does.
1. ↩︎
  To be fair it’s kind of a toy case where the apparent “compute multiplier” couldn’t seriously be expected to be real, but it does illustrate the issue with the methodology of looking at loss/perplexity plots where the learning rate is still high, or might differ significantly between points being compared on the plots for different architecture variants, hopelessly confounding the calculation of a compute multiplier.

Vladimir_Nesov Apr 9, 2025, 10:53 PM
10 points
0
on: The case for AGI by 2030
spending tens of billions of dollars to build clusters that could train a GPT-6-sized model in 2028

Traditionally steps of GPT series are roughly 100x in raw compute (I’m not counting effective compute, since it’s not relevant to cost of training). GPT-4 is 2e25 FLOPs. Which puts “GPT-6” at 2e29 FLOPs. To train a model in 2028, you would build an Nvidia Rubin Ultra NVL576 (Kyber) training system in 2027. Each rack holds 576 compute dies at about 3e15 BF16 FLOP/s per die^[1] or 1.6e18 FLOP/s per rack. A Blackwell NVL72 datacenter costs about $4M per rack to build, possibly a non-Ultra Rubin NVL144 datacenter will cost about $5M per rack, and a Rubin Ultra NVL576 datacenter might cost about $12M per rack^[2].

To get 2e29 BF16 FLOPs in 4 months at 40% utilization, you’d need 30K racks that would cost about $360B all-in (together with the rest of the training system). Which is significantly more than “tens of billions of dollars”.

GPT-8 would require trillions

“GPT-8” is two steps of 100x in raw compute up from “GPT-6″, at 2e33 FLOPs. You’d need to use 10000x more compute than what $360B buy in 2027. Divide it by how much cheaper that compute gets within a few years, let’s say 8x cheaper. What we get is $450T, which is much more than merely “trillions”, and also technologically impossible to produce at that time without transformative AI.
1. ↩︎
  Chips in Blackwell GB200 systems are manufactured with 4nm process and produce about 2.5 dense BF16 FLOP/s per chip, with each chip holding 2 almost reticle sized compute dies. Rubin moves to 3nm, compared to Blackwell at 4nm, which makes each die about 2x more performant (from more transistors and higher clock speed, but the die size must remain the same), which predicts about 2.5 dense BF16 FLOP/s per die or 5 BF16 FLOP/s per 2-die chip. (Nvidia announced that dense FP8 performance will increase 3.3x, but that’s probably due to giving more transistors to FP8, which can’t be done as much for BF16 since it already needs a lot.)
  
  To separately support this, today Google announced Ironwood, their 7th generation of TPU (that might go into production in late 2026). The announcement includes a video that shows that it’s a 2-die chip, same as non-Ultra Rubin, and it was also previously reported to be manufactured with 3nm. In today’s announcement, its performance is quoted as 4.6e15 FLOP/s, which from context of comparing with 459e12 FLOP/s of TPUv5p is likely dense BF16. This means 2.3e15 dense BF16 FLOP/s per compute die, close to my estimate for a Rubin compute die.
  
  A Kyber rack was announced to need 600 KW per rack (1.04 KW/die within-rack all-in), compared to Blackwell NVL72 at 120-130 KW per rack (0.83-0.90 KW/die within-rack all-in). Earlier non-Ultra Rubin NVL144 is a rack with the same number of chips and compute dies as Blackwell NVL72, so it might be using at most slightly higher power per compute die (let’s say 0.90 KW/die within-rack all-in). Thus the clock speed for Rubin Ultra might be up to ~1.15x higher than for non-Ultra Rubin, meaning performance of Rubin Ultra might reach 2.9e15 dense BF16 FLOP/s per die (12e15 FLOP/s per chip, 1.6e18 FLOP/s per rack).
2. ↩︎
  In a Rubin Ultra NVL576 rack, chips have 4 compute dies each, compared to only 2 dies per chip in a non-Ultra Rubin NVL144 rack. Since Nvidia sells at a large margin per compute die, and its real product is the whole system rather than the individual compute dies, it can afford to keep cutting the margin per die, while the cost of the rest of the system scales with the number of chips rather then the number of dies. The NVL576 rack has 2x more chips than the ~$5M NVL144 rack, so if the cost per chip only increases slightly, we get $12M per rack.

Vladimir_Nesov Apr 9, 2025, 9:23 PM
4 points
10
in reply to: Noosphere89’s comment on: AI 2027: What Superintelligence Looks Like

probability mass for AI that can automate all AI research is in the 2030s … broadly due to the tariffs and …

Without AGI, scaling of hardware runs into the financial ~$200bn individual training system cost wall in 2027-2029. Any tribulations on the way (or conversely efforts to pool heterogeneous and geographically distributed compute) only delay that point slightly (when compared to the current pace of increase in funding), and you end up in approximately the same place, slowing down to the speed of advancement in FLOP/s per watt (or per dollar). Without transformative AI, anything close to the current pace is unlikely to last into the 2030s.