Noam Brown: “These aren’t the raw CoTs but it’s a big step closer.”
Thane Ruthenis
I think that’s a pretty plausible way the world could be, yes.
I still expect the Singularity somewhere in the 2030s, even under that model.
GPT-5 ought to come around September 20, 2028, but Altman said today it’ll be out within months
I don’t think what he said meant what you think it meant. Exact words:
In both ChatGPT and our API, we will release GPT-5 as a system that integrates a lot of our technology, including o3
The “GPT-5” he’s talking about is not the next generation of GPT-4, not an even bigger pretrained LLM. It is some wrapper over GPT-4.5/Orion, their reasoning models, and their agent models. My interpretation is that “GPT-5″ the product and GPT-5 the hypothetical 100x-bigger GPT model, are two completely different things.
I assume they’re doing these naming shenanigans specifically to confuse people and create the illusion of continued rapid progress. This “GPT-5” is probably going to look pretty impressive, especially to people in the free tier, who’ve only been familiar with e. g. GPT-4o so far.
Anyway, I think the actual reason the proper GPT-5 – that is, an LLM 100+ times as big as GPT-4 – isn’t out yet is because the datacenters powerful enough to train it are only just now coming online. It’ll probably be produced in 2026.
followed by the promised ‘GPT-5’ within months that Altman says is smarter than he is
Ha, good catch. I don’t think GPT-5 the future pretrained LLM 100x the size of GPT-4 which Altman says will be smarter than he is, and GPT-5 the Frankensteinian monstrosity into which OpenAI plan to stitch all of their current offerings, are the same entity.
Edit: It occurred to me that this reading may not be obvious. Look at Altman’s Exact Words:
In both ChatGPT and our API, we will release GPT-5 as a system that integrates a lot of our technology, including o3. We will no longer ship o3 as a standalone model.
He is saying, quite clearly, that they are going to create a wrapper over {Orion, o3, search, Deep Research, Operator, etc.} that would turn those disparate tools into a “magic unified intelligence”, and then they will arbitrarily slap the “GPT-5″ label on that thing. My interpretation is that the “GPT-5” being talked about is not the next model in the GPT-n series.
Why are they doing that? My guess is, to sow confusion and raise hype. GPT-5 the Frankenstein’s Monster is probably going to look/feel pretty impressive, capable of doing all of these disparate things. In particular, it is going to be shockingly impressive to normies at the free tier, if the “baseline” level of its intelligence gives them a taste of the previously-paid offerings.
All very important if you want to maintain the hype momentum.
Technology for efficient human uploading. Ideally backed by theory we can independently verify as correct and doing what it’s intended to do (rather than e. g. replacing the human upload with a copy of the AGI who developed this technology).
What are the current best theories regarding why Altman is doing the for-profit transition at all?
I don’t buy the getting-rid-of-profit-caps motive. It seems to live in a fantasy universe in which they expect to have a controllable ASI on their hands, yet… still be bound by economic rules and regulations? If OpenAI controls an ASI, OpenAI’s leadership would be able to unilaterally decide where the resources go, regardless of what various contracts and laws say. If the profit caps are there but Altman wants to reward loyal investors, all profits will go to his cronies. If the profit caps are gone but Altman is feeling altruistic, he’ll throw the investors a modest fraction of the gains and distribute the rest however he sees fit. The legal structure doesn’t matter; what matters is who physically types what commands into the ASI control terminal.
Maybe it makes sense in a soft-takeoff world where several actors get ASI tools literally simultaneously, and then some of them choose to use these tools to enforce pre-Singularity agreements (but not prevent other ASIs from being created?), thereby binding OpenAI to obey their contracts...? Sounds incredibly dubious.
Or perhaps it doesn’t need to make sense. Perhaps the point is to make it look like investing in OpenAI big now can make you a god-king alongside Altman, thereby attracting massive funding for OpenAI in the now, with the investors being too stupid/FOMO’d to realize these agreements won’t matter?
I do think RL works “as intended” to some extent, teaching models some actual reasoning skills, much like SSL works “as intended” to some extent, chiseling-in some generalizable knowledge. The question is to what extent it’s one or the other.
Some more evidence that whatever the AI progress on benchmarks is measuring, it’s likely not measuring what you think it’s measuring:
AIME I 2025: A Cautionary Tale About Math Benchmarks and Data Contamination
AIME 2025 part I was conducted yesterday, and the scores of some language models are available here:
https://matharena.ai thanks to @mbalunovic, @ni_jovanovic et al.I have to say I was impressed, as I predicted the smaller distilled models would crash and burn, but they actually scored at a reasonable 25-50%.
That was surprising to me! Since these are new problems, not seen during training, right? I expected smaller models to barely score above 0%. It’s really hard to believe that a 1.5B model can solve pre-math olympiad problems when it can’t multiply 3-digit numbers. I was wrong, I guess.
I then used openai’s Deep Research to see if similar problems to those in AIME 2025 exist on the internet. And guess what? An identical problem to Q1 of AIME 2025 exists on Quora:
https://quora.com/In-what-bases-b-does-b-7-divide-into-9b-7-without-any-remainder
I thought maybe it was just coincidence, and used Deep Research again on Problem 3. And guess what? A very similar question was on math.stackexchange:
https://math.stackexchange.com/questions/3548821/
Still skeptical, I used Deep Research on Problem 5, and a near identical problem appears again on math.stackexchange:
I haven’t checked beyond that because the freaking p-value is too low already. Problems near identical to the test set can be found online.
So, what—if anything—does this imply for Math benchmarks? And what does it imply for all the sudden hill climbing due to RL?
I’m not certain, and there is a reasonable argument that even if something in the train-set contains near-identical but not exact copies of test data, it’s still generalization. I am sympathetic to that. But, I also wouldn’t rule out that GRPO is amazing at sharpening memories along with math skills.
At the very least, the above show that data decontamination is hard.
Never ever underestimate the amount of stuff you can find online. Practically everything exists online.
I think one of the other problems with benchmarks is that they necessarily select for formulaic/uninteresting problems that we fundamentally know how to solve. If a mathematician figured out something genuinely novel and important, it wouldn’t go into a benchmark (even if it were initially intended for a benchmark), it’d go into a math research paper. Same for programmers figuring out some usefully novel architecture/algorithmic improvement. Graduate students don’t have a bird’s-eye-view on the entirety of human knowledge, so they have to actually do the work, but the LLM just modifies the near-perfect-fit answer from an obscure publication/math.stackexchange thread or something.
I expect the same is the case with programming benchmarks, science-quiz benchmarks, et cetera.
Now, this doesn’t necessarily mean that the AI progress has been largely illusory and that we’re way further from AGI than the AI hype men would have you believe (although I am very tempted to make this very pleasant claim, and I do place plenty of probability mass on it).
But if you’re scoring AIs by the problems they succeed at, rather than the problems they fail at, you’re likely massively overestimating their actual problem-solving capabilities.
One way to make this work is to just not consider your driven-to-madness future self an authority on the matter of what’s good or not. You can expect to start wishing for death, and still take actions that would lead you to this state, because present!you thinks that existing in a state of wishing for death is better than not existing at all.
I think that’s perfectly coherent.
No, I mean, I think some people actually hold that any existence is better than non-existence, so death is -inf for them and existence, even in any kind of hellscape, is above-zero utility.
I think “enforce NAP then give everyone a giant pile of resources to do whatever they want with” is a reasonable first-approximation idea regarding what to do with ASI, and it sounds consistent with Altman’s words.
But I don’t believe that he’s actually going to do that, so I think it’s just (3).
They are not necessarily “seeing” -inf in the way you or me are. They’re just kinda not thinking about it, or think that 0 (death) is the lowest utility can realistically go.
What looks like an S-risk to you or me may not count as -inf for some people.
Good point regarding GPT-”4.5″. I guess I shouldn’t have assumed that everyone else has also read Nesov’s analyses and immediately (accurately) classified them as correct.
that is a surprising amount of information
Is it? What of this is new?
To my eyes, the only remotely new thing is the admission that “there’s a lot of research still to get to [a coding agent]”.
You’re not accounting for enemy action. They couldn’t have been sure, at the onset, how successful the AI Notkilleveryoneism faction will be at raising alarm, and in general, how blatant the risks will become to the outsiders as capabilities progress. And they have been intimately familiar with the relevant discussions, after all.
So they might’ve overcorrected, and considered that the “strategic middle ground” would be to admit the risk is plausible (but not as certain as the “doomers” say), rather than to deny it (which they might’ve expected to become a delusional-looking position in the future, so not a PR-friendly stance to take).
Or, at least, I think this could’ve been a relevant factor there.
Any chance you can post (or PM me) the three problems AIs have already beaten?
This really depends on the definition of “smarter”. There is a valid sense in which Stockfish is “smarter” than any human. Likewise, there are many valid senses in which GPT-4 is “smarter” than some humans, and some valid senses in which GPT-4 is “smarter” than all humans (e. g., at token prediction). There will be senses in which GPT-5 will be “smarter” than a bigger fraction of humans compared to GPT-4, perhaps being smarter than Sam Altman under a bigger set of possible definitions of “smarter”.
Will that actually mean anything? Who knows.
By playing with definitions like this, Sam Altman can simultaneously inspire hype by implication (“GPT-5 will be a superintelligent AGI!!!”) and then, if GPT-5 underdelivers, avoid significant reputational losses by assigning a different meaning to his past words (“here’s a very specific sense in which GPT-5 is smarter than me, that’s what I meant, hype is out of control again, smh”). This is a classic tactic; basically a motte-and-bailey variant.
If anyone is game for creating an agentic research scaffold like that Thane describes
Here’s a more detailed the basic structure as envisioned after 5 minutes’ thought:
You feed a research prompt to the “Outer Loop” of a model, maybe have a back-and-forth fleshing out the details.
The Outer Loop decomposes the research into several promising research directions/parallel subproblems.
Each research direction/subproblem is handed off to a “Subagent” instance of the model.
Each Subagent runs search queries on the web and analyses the results, up to the limits of its context window. After the analysis, it’s prompted to evaluate (1) which of the results/sources are most relevant and which should be thrown out, (2) whether this research direction is promising and what follow-up questions are worth asking.
If a Subagent is very eager to pursue a follow-up question, it can either run a subsequent search query (if there’s enough space in the context window), or it’s prompted to distill its current findings and replace itself with a next-iteration Subagent, in whose context it loads only the most important results + its analyses + the follow-up question.
This is allowed up to some iteration count.
Once all Subagents have completed their research, instantiate an Evaluator instance, into whose context window we dump the final results of each Subagent’s efforts (distilling if necessary). The Evaluator integrates the information from all parallel research directions and determines whether the research prompt has been satisfactorily addressed, and if not, what follow-up questions are worth pursuing.
The Evaluator’s final conclusions are dumped into the Outer Loop’s context (without the source documents, to not overload the context window).
If the Evaluator did not choose to terminate, the next generation of Subagents is spawned, each prompted with whatever contexts are recommended by the Evaluator.
Iterate, spawning further Evaluator instances and Subproblem instances as needed.
Once the Evaluator chooses to terminate, or some compute upper-bound is reached, the Outer Loop instantiates a Summarizer, into which it dumps all of the Evaluator’s analyses + all of the most important search results.
The Summarizer is prompted to generate a high-level report outline, then write out each subsection, then the subsections are patched together into a final report.
Here’s what this complicated setup is supposed to achieve:
Do a pass over an actually large, diverse amount of sources. Most of such web-search tools (Google’s, DeepSeek’s, or this shallow replication) are basically only allowed to make one search query, and then they have to contend with whatever got dumped into their context window. If the first search query turns out poorly targeted, in a way that becomes obvious after looking through the results, the AI’s out of luck.
Avoid falling into “rabbit holes”, i. e. some arcane-and-useless subproblems the model becomes obsessed with. Subagents would be allowed to fall into them, but presumably Evaluator steps, with a bird’s-eye view of the picture, would recognize that for a failure mode and not recommend the Outer Loop to follow it.
Attempt to patch together a “bird’s eye view” on the entirety of the results given the limitations of the context window. Subagents and Evaluators would use their limited scope to figure out what results are most important, then provide summaries + recommendations based on what’s visible from their limited vantage points. The Outer Loop and then the Summarizer instances, prompted with information-dense distillations of what’s visible from each lower-level vantage point, should effectively be looking at (a decent approximation of) the full scope of the problem.
Has anything like this been implemented in the open-source community or one of the countless LLM-wrapper startups? I would expect so, since it seems like an obvious thing to try + the old AutoGPT scaffolds worked in a somewhat similar manner… But it’s possible the market’s totally failed to deliver.
It should be relatively easy to set up using Flowise plus e. g. Firecrawl. If nothing like this has been implemented, I’m putting a
$500$250 bounty on it.Edit: Alright, this seems like something very close to it.
I really wonder how much of the perceived performance improvement comes from agent-y training, as opposed to not sabotaging the format of the model’s answer + letting it do multi-turn search.
Compared to most other LLMs, Deep Research is able to generate reports up to 10k words, which is very helpful for providing comprehensive-feeling summaries. And unlike Google’s analogue, it’s not been fine-tuned to follow some awful deliberately shallow report template[1].
In other words: I wonder how much of Deep Research’s capability can be replicated by putting Claude 3.5.1 in some basic agency scaffold where it’s able to do multi-turn web search, and is then structurally forced to generate a long-form response (say, by always prompting with something like “now generate a high-level report outline”, then with “now write section 1″, “now write section 2”, and then just patching those responses together).
Have there been some open-source projects that already did so? If not, anyone willing to try? This would provide a “baseline” for estimating how well OpenAI’s agency training actually works, how much improvement is from it.
Subjectively, its use as a research tool is limited—it found only 18 sources in a five-minute search
Another test to run: if you give those sources to Sonnet 3.5.1 (or a Gemini model, if Sonnet’s context window is too small) and ask it to provide a summary, how far from Deep Research’s does it feel in quality/insight?
- ^
Which is the vibe I got from Google’s Deep Research when I’d been experimenting. I think its main issue isn’t any lack of capability, but that the fine-tuning dataset for how the final reports are supposed to look had been bad: it’s been taught to have an atrocious aesthetic taste.
- ^
Fair enough, I suppose calling it an outright wrapper was an oversimplification. It still basically sounds like just the sum of the current offerings.