“We’re doing this new project called Stargate which has about 100 times the computing power of our current computer”
“We used to be in a paradigm where we only did pretraining, and each GPT number was exactly 100x, or not exactly but very close to 100x and at each of those there was a major new emergent thing. Internally we’ve gone all the way to about a maybe like a 4.5”
“We can get performance on a lot of benchmarks [using reasoning models] that in the old world we would have predicted wouldn’t have come until GPT-6, something like that, from models that are much smaller by doing this reinforcement learning.”
“The trick is when we do it this new way [using RL for reasoning], it doesn’t get better at everything. We can get it better in certain dimensions. But we can now more intelligently than before say that if we were able to pretrain a much bigger model and do [RL for reasoning], where would it be. And the thing that I would expect based off of what we’re seeing with a jump like that is the first bits or sort of signs of life on genuine new scientific knowledge.”
“Our very first reasoning model was a top 1 millionth competitive programmer in the world [...] We then had a model that got to top 10,000 [...] O3, which we talked about publicly in December, is the 175th best competitive programmer in the world. I think our internal benchmark is now around 50 and maybe we’ll hit number one by the end of this year.”
“There’s a lot of research still to get to [a coding agent]”
The estimate of the compute of their largest version ever (which is a very helpful way to phrase it) at only <=50x GPT-4 is quite relevant to many discussions (props to Nesov) and something Altman probably shouldn’t’ve said.
The estimate of test-time compute at 1000x effective-compute is confirmation of looser talk.
The scientific research part is of uncertain importance but we may well be referring back to this statement a year from now.
Good point regarding GPT-”4.5″. I guess I shouldn’t have assumed that everyone else has also read Nesov’s analyses and immediately (accurately) classified them as correct.
He says things that are advantageous, and sometimes they are even true. The benefit of not being known to be a liar usually keeps the correlation between claims and truth positive, but in his case it seems that ship has sailed.
(Checkably false claims are still pretty rare, and this may be one of those.)
If current levels are around GPT-4.5, the compute increase from GPT-4 would be either 10× or 50×, depending on whether we use a log or linear scaling assumption.
The completion of Stargate would then push OpenAI’s compute to around GPT-5.5 levels. However, since other compute expansions (e.g., Azure scaling) are also ongoing, they may reach this level sooner.
Recent discussions have suggested that better base models are a key enabler for the current RL approaches, rather than major changes in RL architecture itself. This suggests that once the base model shifts from a GPT-4o-scale model to a GPT-5.5-scale model, there could be a strong jump in capabilities.
It’s unclear how much of a difference it makes to train the new base model (GPT-5) on reasoning traces from O3/O4 before applying RL. However, by the time the GPT-5 scale run begins, there will likely be a large corpus of filtered, high-quality reasoning traces, further edited for clarity, that will be incorporated into pretraining.
The change to a better base model for RL might enable longer horizon agentic work as an “emergent thing”, combined with superhuman coding skills this might already be quite unsafe.
GPT-5’s reasoning abilities may be significantly more domain-specific than prior models.
At a talk at UTokyo, Sam Altman said (clipped here and here):
“We’re doing this new project called Stargate which has about 100 times the computing power of our current computer”
“We used to be in a paradigm where we only did pretraining, and each GPT number was exactly 100x, or not exactly but very close to 100x and at each of those there was a major new emergent thing. Internally we’ve gone all the way to about a maybe like a 4.5”
“We can get performance on a lot of benchmarks [using reasoning models] that in the old world we would have predicted wouldn’t have come until GPT-6, something like that, from models that are much smaller by doing this reinforcement learning.”
“The trick is when we do it this new way [using RL for reasoning], it doesn’t get better at everything. We can get it better in certain dimensions. But we can now more intelligently than before say that if we were able to pretrain a much bigger model and do [RL for reasoning], where would it be. And the thing that I would expect based off of what we’re seeing with a jump like that is the first bits or sort of signs of life on genuine new scientific knowledge.”
“Our very first reasoning model was a top 1 millionth competitive programmer in the world [...] We then had a model that got to top 10,000 [...] O3, which we talked about publicly in December, is the 175th best competitive programmer in the world. I think our internal benchmark is now around 50 and maybe we’ll hit number one by the end of this year.”
“There’s a lot of research still to get to [a coding agent]”
Wow, that is a surprising amount of information. I wonder how reliable we should expect this to be.
Is it? What of this is new?
To my eyes, the only remotely new thing is the admission that “there’s a lot of research still to get to [a coding agent]”.
The estimate of the compute of their largest version ever (which is a very helpful way to phrase it) at only <=50x GPT-4 is quite relevant to many discussions (props to Nesov) and something Altman probably shouldn’t’ve said.
The estimate of test-time compute at 1000x effective-compute is confirmation of looser talk.
The scientific research part is of uncertain importance but we may well be referring back to this statement a year from now.
Good point regarding GPT-”4.5″. I guess I shouldn’t have assumed that everyone else has also read Nesov’s analyses and immediately (accurately) classified them as correct.
It’s just surprising that Sam is willing to say/confirm all of this given that AI companies normally at least try to be secretive.
He says things that are advantageous, and sometimes they are even true. The benefit of not being known to be a liar usually keeps the correlation between claims and truth positive, but in his case it seems that ship has sailed.
(Checkably false claims are still pretty rare, and this may be one of those.)
That seems to imply that:
If current levels are around GPT-4.5, the compute increase from GPT-4 would be either 10× or 50×, depending on whether we use a log or linear scaling assumption.
The completion of Stargate would then push OpenAI’s compute to around GPT-5.5 levels. However, since other compute expansions (e.g., Azure scaling) are also ongoing, they may reach this level sooner.
Recent discussions have suggested that better base models are a key enabler for the current RL approaches, rather than major changes in RL architecture itself. This suggests that once the base model shifts from a GPT-4o-scale model to a GPT-5.5-scale model, there could be a strong jump in capabilities.
It’s unclear how much of a difference it makes to train the new base model (GPT-5) on reasoning traces from O3/O4 before applying RL. However, by the time the GPT-5 scale run begins, there will likely be a large corpus of filtered, high-quality reasoning traces, further edited for clarity, that will be incorporated into pretraining.
The change to a better base model for RL might enable longer horizon agentic work as an “emergent thing”, combined with superhuman coding skills this might already be quite unsafe.
GPT-5’s reasoning abilities may be significantly more domain-specific than prior models.