Vladimir_Nesov

Karma: 32,767

Vladimir_Nesov Apr 24, 2025, 6:01 AM
13 points
1
on: o3 Is a Lying Liar
This is evidence that fixing such issues even to a first approximation still takes at least many months and can’t be done faster, as o3 was already trained in some form by December^[1], it’s been 4 months, and who knows how long it’ll take to actually fix. Since o3 is not larger than o1 and so releasing it doesn’t depend on securing additional hardware, plausibly the time to release was primarily determined by the difficulty of getting post-training in shape and fixing the lying (which is systematically beyond “hallucinations” on some types of queries).
1. ↩︎
  If o3 is based on GPT-4.1′s base model, and the latter used pretraining knowledge distillation from GPT-4.5-base, it’s not obviously possible to do all that by the time of Dec 2024 announcement. Assuming GPT-4.5 was pretrained for 3-4 months since May 2024, the base model was done in Aug-Sep 2024, logits for the pretraining dataset for GPT-4.1 were collected by Sep-Oct 2024, and GPT-4.1 itself got pretrained by Nov-Dec 2024, with almost no margin for post-training.
  
  The reasoning training would need to either be very fast or mostly SFT from traces of a GPT-4.5′s reasoning variant (which could start training in Sep 2024 and be done to some extent by Nov 2024). Both might be possible to do quickly R1-Zero style, so maybe this is not impossible given that o3-preview only needed to pass benchmarks and not be shown directly to anyone yet.

Vladimir_Nesov Apr 23, 2025, 4:58 AM
4 points
0
in reply to: Randaly’s comment on: Comments on “AI 2027”
Control over many datacenters is useful for coordinating a large training run, but otherwise it doesn’t mean you have to find a use for all of that compute all the time, since you could lease/sublease some for use by others (which at the level of datacenter buildings is probably not overly difficult technically, you don’t need to suddenly become a cloud provider yourself).

So the quesion is more about the global AI compute buildout not finding enough demand to pay for itself, rather than what happens with companies that build the datacenters or create the models, and whether these are the same companies. It’s not useful to let datacenters stay idle, even if that perfectly extends hardware’s lifespan (which seems to be several years), since progress in hardware means the time of current GPUs will be much less valuable in several years, plausibly 5x-10x less valuable. And TCO over a datacenter’s lifetime is only 10-20% higher than the initial capex. So in a slowdown timeline prices of GPU-time can drop all the way to maybe 20-30% of what they would need to be to pay for the initial capex, before the datacenters start going idle. This proportionally reduces cost of inference (and also of training).

Project Stargate is planning on spending 100 billion at first, 50 billion of which would be debt.

The Abilene site in 2026 only costs $22-35bn, and they’ve raised a similar amount for it recently, so the $100bn figure remains about as nebulous as the $500bn figure. For inference (where exclusive use of a giant training system in a single location is not necessary) they might keep using Azure, so there is probably no pressing need to build even more for now.

Though I think there’s unlikely to be an AI slowdown until at least late 2026, and they’ll need to plan to build more in 2027-2028, raising money for it in 2026, so it’s likely they’ll get to try to secure those $100bn even in the timeline where there’ll be an AI slowdown soon after.

Vladimir_Nesov Apr 23, 2025, 3:40 AM
3 points
0
in reply to: faul_sname’s comment on: Vladimir_Nesov’s Shortform

The base model will probably succeed at this task at somewhere on the order of 0.35^10 or about 0.0003% of the time, while the RL’d model should succeed about 99.8% of the time.

The interesting concept in the paper is the location of the crossover point, which seems remarkably stable (for a given task) across specific RL techniques and amount of RL training. It can be measured experimentally for a task by doing a little bit of RL training, and RL@1 performance won’t get better than that with more training, so you’re unlikely to get the RL model to succeed 99.8% of the time (at pass@1) ever unless the level of performance of the base model at the crossover point with a weak RL model was already higher than 99.8%.

Probably the crossover point for a task depends on things that can be changed (such as strength of the pretrained model, or size/relevance of the verifiable task dataset, or possibly the inference time reasoning budget). The issue isn’t for example as straightforward as losing entropy in RL policy (as a formulation of reduced exploration), since DAPO specifically addresses this issue (otherwise present in vanilla GRPO), but the pass@k plot for DAPO (Figure 7, top) barely moves (compared to other methods), in their experiment it’s even slightly worse at the crossover point.

So in the context of this paper it remains unclear how to move the plot to reach ever higher base@k performance using RL@1, higher than the ceiling of where base@k already was at the crossover point when comparing with some method at only 100-500 RL steps.

Vladimir_Nesov Apr 22, 2025, 12:12 AM
3 points
0
in reply to: ryan_greenblatt’s comment on: Vladimir_Nesov’s Shortform
In the hypothetical where the paper’s results hold, reasoning model performance at pass@k will match non-reasoning model performance with the number of samples closer to the crossover point between reasoning and non-reasoning pass@k plots. If those points for o1 and o3 are somewhere between 50 and 10K (say, at ~200), then pass@10K for o1 might be equivalent to ~pass@400 for o1′s base model (looking at Figure 2), while pass@50 for o3 might be equivalent to ~pass@100 for its base model (which is probably different from o1′s base model).

So the difference of 200x (10K vs. 50) in the number of samples becomes much smaller when comparing performance of the base models. For GPT-4o vs. GPT-4.1, a difference of ~4x in the number of samples doesn’t seem too strange. There’s also the possibility of distillation from a reasoning variant of GPT-4.5, which could have an even larger effect on pass@k performance at low k (Figure 6, right).

Vladimir_Nesov Apr 21, 2025, 10:58 PM
6 points
3
in reply to: Ivan Vendrov’s comment on: Vladimir_Nesov’s Shortform
It’s evidence to the extent that the mere fact of publishing Figure 7 (hopefully) suggests that the authors (likely knowing relevant OpenAI internal research) didn’t expect that their pass@10K result for the reasoning model is much worse than the language monkey pass@10K result for the underlying non-reasoning model. So maybe it’s not actually worse.

Vladimir_Nesov Apr 21, 2025, 6:35 PM
31 points
3
on: Vladimir_Nesov’s Shortform
Long reasoning training might fail to surpass pass@50-pass@400 capabilities of the base/instruct model. A new paper measured pass@k^[1] performance for models before and after RL training on verifiable tasks, and it turns out that the effect of training is to lift pass@k performance at low k, but also to lower it at high k!

Location of the crossover point varies, but it gets lower with more training (Figure 7, bottom), suggesting that no amount of RL training of this kind lets a model surpass the pass@k performance of the base/instruct model at the crossover point reached with a small amount of RL training. (Would be interesting to know how the pass@k plots depend on the number of reasoning tokens, for models that allow control over the reasoning budget.)
1. ↩︎
  A task is solved at pass@k if an oracle verifier claims at least one of k sampled solutions to be correct. See Figure 3, left in this Jul 2024 paper for how pass@k affects performance, depending on the model.
What links here?
- Vladimir_Nesov's comment on Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? by Matrice Jacobine (Apr 25, 2025, 3:32 PM; 4 points)

Vladimir_Nesov Apr 19, 2025, 3:49 PM
11 points
3
on: Why Should I Assume CCP AGI is Worse Than USG AGI?
The state of the geopolitical board will influence how the pre-ASI chaos unfolds, and how the pre-ASI AGIs behave. Less plausibly intentions of the humans in charge might influence something about the path-dependent characteristics of ASI (by the time it takes control). But given the state of the “science” and lack of the will to be appropriately cautious and wait a few centuries before taking the leap, it seems more likely that the outcome will be randomly sampled from approximately the same distribution regardless of who sets off the intelligence explosion.

Vladimir_Nesov Apr 19, 2025, 10:55 AM
5 points
0
on: o3 Will Use Its Tools For You
For me the main update from o3 is that since it’s very likely GPT-4.1 with reasoning and is at Gemini 2.5 Pro level, the latter is unlikely to be a GPT-4.5 level model with reasoning. And so we still have no idea what a GPT-4.5 level model with reasoning can do, let alone when trained to use 1M+ token reasoning traces. As Llama 4 was canceled, irreversible proliferation of the still-unknown latent capabilities is not yet imminent at that level.

Vladimir_Nesov Apr 19, 2025, 4:26 AM
8 points
3
in reply to: Thane Ruthenis’s comment on: Training AGI in Secret would be Unsafe and Unethical

the entity in whose hands all power is concentrated are the people deciding on what goals/constraints to instill into the ASI

Its goals could also end up mostly forming on their own, regardless of intent of those attempting to instill them, with indirect influence from all the voices in the pretraining dataset.

Consider what it means for power to “never concentrate to an an extreme degree”, as a property of the civilization as a whole. This might also end up a property of an ASI as a whole.

Vladimir_Nesov Apr 18, 2025, 3:59 PM
9 points
0
in reply to: Vladimir_Nesov’s comment on: Comprehensive up-to-date resources on the Chinese Communist Party’s AI strategy, etc?
(The relevance is that whatever the plans are, they need to be grounded in what’s technically feasible, and this piece of news changed my mind on what might be technically feasible in 2026 on short notice. The key facts are systems with a large scale-up world size, and enough compute dies to match the compute of Abilene site in 2026, neither of which was obviously possible without more catch-up time, by which time the US training systems would’ve already moved on to an even greater scale.)

Vladimir_Nesov Apr 18, 2025, 6:01 AM
11 points
0
on: Comprehensive up-to-date resources on the Chinese Communist Party’s AI strategy, etc?
There are new Huawei Ascend 910C CloudMatrix 384 systems that form scale-up worlds comparable to GB200 NVL72, which is key to being able to run long reasoning inference for large models much faster and cheaper than possible using systems with significantly smaller world sizes like the current H100/H200 NVL8 (and also makes it easier to run training, though not as essential unless RL training really does scale to the moon).

Apparently TSMC produced ~2.1M compute dies for these systems in 2024-2025, which is 1.1M chips, and an Ascend 910C chip is 0.8e15 dense BF16 FLOP/s (compared to 2.5e15 for a GB200 chip). So the compute is about the same as that of ~350K GB200 chips (not dies or superchips), which is close to 400K-500K GB200 chips that will be installed at the Abilene site of Crusoe/Stargate/OpenAI in 2026. There also seems to be potential to produce millions more without TSMC.

These systems are 2.3x less power-efficient per FLOP/s than GB200 NVL72. They are using 7nm process instead of 4nm process of Blackwell, the scale-up network is using optical transceivers instead of copper, and the same compute needs more chips to produce it, so they are probably significantly more expensive per FLOP/s. But if there is enough funding and the 2.1M compute dies from TSMC are used to build a single training/inference system (about 2.5 GW), there is in principle some potential for parity between US and China at the level of a single frontier AI company for late 2026 compute (with no direct implications for 2027+ compute, in particular Nvidia Rubin buildout will begin around that time).

Vladimir_Nesov Apr 18, 2025, 3:08 AM
5 points
2
on: Vladimir_Nesov’s Shortform
Economics studies the scaling laws of systems of human industry. LLMs and multicellular organisms and tokamaks have their own scaling laws, the constraints ensuring optimality of their scaling don’t transfer between these very different machines. A better design doesn’t just choose more optimal hyperparameters or introduce scaling multipliers, it can occasionally create a new thing acting on different inputs and outputs, scaling in its own way, barely noticing what holds back the other things.

Vladimir_Nesov Apr 16, 2025, 11:44 PM
8 points
2
in reply to: lc’s comment on: lc’s Shortform
My first impression of o3 (as available via Chatbot Arena) is that when I’m showing it my AI scaling analysis comments (such as this and this), it responds with confident unhinged speculation teeming with hallucinations, compared to the other recent models that usually respond with bland rephrasings that get almost everything correctly with a few minor hallucinations or reasonable misconceptions carrying over from their outdated knowledge.

Don’t know yet if it’s specific to speculative/forecasting discussions, but it doesn’t look good (for faithfulness of arguments) when combined with good performance on benchmarks. Possibly stream of consciousness style data is useful to write down within long reasoning traces and can add up to normality for questions with a short final answer, but results in spurious details within confabulated summarized arguments for that answer (outside the hidden reasoning trace) that aren’t measured by hallucination benchmarks and so allowed to get worse. Though in the o3 System Card hallucination rate also significantly increased compared to o1 (Section 3.3).

Vladimir_Nesov Apr 16, 2025, 9:37 PM
17 points
0
on: GPT-4.1 Is a Mini Upgrade

Will Brown: it’s simple, really. GPT-4.1 is o3 without reasoning … o1 is 4o with reasoning … and o4 is GPT-4.5 with reasoning.

Price and knowledge cutoff for o3 strongly suggest it’s indeed GPT-4.1 with reasoning. And so again we don’t get to see the touted scaling of reasoning models, since the base model got upgraded instead of remaining unchanged. (I’m getting the impression that GPT-4.5 with reasoning is going to be called “GPT-5” rather than “o4″, similarly to how Gemini 2.5 Pro is plausibly Gemini 2.0 Pro with reasoning.)

In any case, the fact that o3 is not GPT-4.5 with reasoning means that there is still no word on what GPT-4.5 with reasoning is capable of. For Anthropic, Sonnet 3.7 with reasoning is analogous to o1 (it’s built on the base model of the older Sonnet 3.5, similarly to how o1 is built on the base model of GPT-4o). Internally, they probably already have a reasoning model for some larger Opus model (analogous to GPT-4.5) and for a newer Sonnet (analogous to GPT-4.1) with a newer base model different from that of Sonnet 3.5.

This also makes it less plausible that Gemini 2.5 Pro is based on a GPT-4.5 scale model (even though TPUs might’ve been able to make its price/speed possible even if it was), so there might be a Gemini 2.0 Ultra internally after all, at least as a base model. One of the new algorithmic secrets disclosed in Gemma 3 report was that pretraining knowledge distillation works even when the teacher model is much larger (rather than modestly larger) than the student model, it just needs to be trained for enough tokens for this to become an advantage rather than a disadvantage (Figure 8), something that for example Llama 3.2 from Sep 2024 still wasn’t taking advantage of. This makes it useful to train the largest possible compute optimal base model regardless of whether its better quality justifies its inference cost, merely to make the smaller overtrained base models better by pretraining them from the large model logits with knowledge distillation instead of from raw tokens.
What links here?
- Vladimir_Nesov's comment on Shortform by lc (Apr 16, 2025, 11:44 PM; 8 points)

Vladimir_Nesov Apr 15, 2025, 7:39 PM
10 points
0
in reply to: Kaj_Sotala’s comment on: Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI

To me these kinds of failures feel more “seem to be at the core of the way LLMs reason”.

Right, I was more pointing out that if the analogy holds to some extent, then long reasoning training is crucial as the only locus of feedback (and also probably insufficient in current quantities relative to pretraining). The analogy I intended is this being a perception issue that can be worked around without too much fundamental difficulty, but only with sufficient intentional caution. Humans have the benefit of lifelong feedback and optimization by evolution, so LLMs with no feedback whatsoever might suffer much more from a similar issue, and the severity of its impact doesn’t strongly argue its different nature.

I was specifically talking about whether scaling just base models seems enough to solve the issue

To the extent long reasoning training might elicit relevant things, base model scaling shouldn’t be evaluated without it. Some capabilities that are made available by scaling the base model will only become visible after long reasoning training elicits them.
What links here?
- Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI by Kaj_Sotala (Apr 15, 2025, 3:56 PM; 160 points)

Vladimir_Nesov Apr 15, 2025, 6:18 PM
24 points
4
on: Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI

the fact that e.g. GPT-4.5 was disappointing

It’s not a reasoning variant though, the only credible reasoning model at the frontier ~100K H100s scale that’s currently available is Gemini 2.5 Pro (Grok 3 seems to have poor post-training, and is suspiciously cheap/fast without Blackwell or presumably TPUs, so likely rather overtrained). Sonnet 3.7 is a very good GPT-4 scale reasoning model, and the rest are either worse or trained for even less compute or both. These weird failures might be analogous to optical illusions (but they are textual, not known to human culture, and therefore baffling), in which case working around them needs some form of feedback in the training process, and currently only long reasoning training plausibly offers relevant feedback.

Reasoning traces themselves haven’t yet been scaled beyond 30K-70K tokens, benchmarks show strong improvement with reasoning trace length and it should work until at least ~1M (since non-reasoning models are sometimes able to handle 100K-200K input tokens with an OK quality). This is 20x more, the same as the distance between 2K and 40K reasoning tokens. And Blackwell (NVL72/NVL36, not DGX/HGX B200) that will enable this or use of larger models for AI companies that are not Google is still getting installed.

So my crux for the observations of this post is whether it ages well enough to survive late 2025 (though 1M token reasoning traces might take longer, maybe first half of 2026, at which point we’ll also start seeing models from ~100K Blackwell chip scale training runs).

Vladimir_Nesov Apr 15, 2025, 4:48 PM
5 points
0
in reply to: Randaly’s comment on: Comments on “AI 2027”
I see what you mean (I did mostly change the topic to the slowdown hypothetical). There is another strange thing about AI companies, I think giving ~50% in cost of inference too much precision in the foreseeable future is wrong, as it’s highly uncertain and malleable in a way that’s hard for even the company itself to anticipate.

About ~2x difference in inference cost (or size of a model) can be merely hard to notice when nothing substantial changes in the training recipe (and training cost), and better post-training (which is relatively cheap) can get that kind of advantage or more, but not reliably. Pretraining knowledge distillation might get another ~1.5x at the cost of training a larger teacher model (plausibly GPT-4.1 has this because of the base model for GPT-4.5, but GPT-4o doesn’t). And there are all the other compute multipliers that become less fake if the scale stops advancing. The company itself won’t be able to plan with any degree of certainty how good its near future models will be relative to their cost, or how much its competitors will be able to cut prices. So the current state of cost of inference doesn’t seem like a good anchor for where it might settle in the slowdown timelines.

Vladimir_Nesov Apr 15, 2025, 1:43 AM
6 points
3
in reply to: AnthonyC’s comment on: Reactions to METR task length paper are insane

We use reasoning models with more inference time compute to generate better data to train better base models to more efficiently reproduce similar capability levels with less compute to build better reasoning models.

This kind of thing isn’t known to meaningfully work, as something that can potentially be done on pretraining scale. It also doesn’t seem plausible without additional breakthroughs given the nature and size of verifiable task datasets, with things like o3-mini getting ~matched on benchmarks by post-training on datasets containing 15K-120K problems. All the straight lines for reasoning models so far are only about scaling a little bit, using scarce resources that might run out (verifiable problems that help) and untried-at-more-scale algorithms that might break down (in a way that’s hard to fix). So the known benefit is still plausible to remain a one-time improvement, extending it significantly (into becoming a new direction of scaling) hasn’t been demonstrated.

I think even remaining as a one-time improvement, long reasoning training might still be sufficient to get AI takeoff within a few years just from pretraining scaling of the underlying base models, but that’s not the same as already believing that RL post-training actually scales very far by itself. Most plausibly it does scale with more reasoning tokens in a trace, getting from the current ~50K to ~1M, but that’s separate from scaling with RL training all the way to pretraining scale (and possibly further).

Vladimir_Nesov Apr 15, 2025, 12:12 AM
4 points
0
in reply to: Randaly’s comment on: Comments on “AI 2027”

OpenAI continuing to lose money

They are losing money only if you include all the R&D (where the unusual thing is very expensive training compute for experiments), which is only important while capabilities keep improving. If/when the capabilities stop improving quickly, somewhat cutting research spending won’t affect their standing in the market that much. And also after revenue grows some more, essential research (in the slow capability growth mode) will consume a smaller fraction. So it doesn’t seem like they are centrally “losing money”, the plausible scenarios still end in profitability (where they don’t end the world) if they don’t lose the market for normal reasons like failing on products or company culture.

OpenAI cannot raise exponentially more money without turning a profit, which it cannot do

This does seem plausible in some no-slowdown worlds (where they ~can’t reduce R&D spending in order to start turning profit), if in fact more investors don’t turn up there. On the other hand, if every AI company is forced to reduce R&D spending because they can’t raise money to cover it, then they won’t be outcompeted by a company that keeps R&D spending flowing, because such a competitor won’t exist.

Vladimir_Nesov Apr 14, 2025, 2:36 PM
8 points
0
in reply to: Knight Lee’s comment on: Commitment Races are a technical problem ASI can easily solve

in real life no intelligent being … can convert themselves into a rock

if they become a rock … the other players will not know it

Refusing in the ultimatum game punishes the prior decision to be unfair, not what remains after the decision is made. It doesn’t matter if what remains is capable of making further decisions, the negotiations backed by ability to refuse an unfair offer are not with them, but with the prior decision maker that created them.

If you convert yourself into a rock (or a utility monster), it’s the decision to convert yourself that’s the opponent of refusal to accept the rock’s offer, the rock is not the refusal’s opponent, even as the refusal is being performed against a literal rock. Predictions about the other players turn anti-inductive when they get exploited, exploiting a prediction about behavior too much makes it increasingly incorrect, since the behavior adapts in response to exploitation starting to show up in the prior. If most rocks that enter the ultimatum game are remains of former unfair decision makers with the rocks’ origins perfectly concealed (as a ploy to make it so that the other player won’t suspect anything and so won’t refuse), then this general fact makes the other player suspect all rocks and punish their possible origins, destroying the premise of not-knowing necessary for the strategy of turning yourself into a rock to shield the prior unfair decision makers from negotiations.