It was already known the AGI Labs were experimenting with synthetic data and that OpenAI are training GPT-5, and the article is light on new details:
It’s not really true that modern AIs “can’t reliably solve math problems they haven’t seen before”: this depends on the operationalization of “a math problem” and “seen before”. All this statement says is “Strawberry is better at math than the SOTA models”, which in turn means “nonzero AI progress”.
Similar for hallucinations.
The one concrete example is solving New York Connections, but Claude 3.5 can already do it on a good day.
I mean, the state of affairs is by no means not worrying, but I don’t really see what’s in this article would prompt a meaningful update?
I also felt like this was mostly priced in, but I think a maybe more useful prompt for people who feel like they made an update: I think this is a good time to ask “How could I have thought that faster?”, and think about what updates you maybe still haven’t fully propagated.
The big answer, now that we know what o1 was made using Q*/Strawberry, is essentially that Strawberry/Q* did 2 very important things:
It cracked the code on how to make a General Purpose Search that scales with more compute, and in particular the model can now adaptively think for longer on harder problems.
In essence, OpenAI figured out how to implement General Purpose Search scalably:
Why?
It was already known the AGI Labs were experimenting with synthetic data and that OpenAI are training GPT-5, and the article is light on new details:
It’s not really true that modern AIs “can’t reliably solve math problems they haven’t seen before”: this depends on the operationalization of “a math problem” and “seen before”. All this statement says is “Strawberry is better at math than the SOTA models”, which in turn means “nonzero AI progress”.
Similar for hallucinations.
The one concrete example is solving New York Connections, but Claude 3.5 can already do it on a good day.
I mean, the state of affairs is by no means not worrying, but I don’t really see what’s in this article would prompt a meaningful update?
I also felt like this was mostly priced in, but I think a maybe more useful prompt for people who feel like they made an update: I think this is a good time to ask “How could I have thought that faster?”, and think about what updates you maybe still haven’t fully propagated.
Agreed, always a good exercise to do when surprised.
We knew they were experimenting with synthetic data. We didn’t know they were succeeding.
The big answer, now that we know what o1 was made using Q*/Strawberry, is essentially that Strawberry/Q* did 2 very important things:
It cracked the code on how to make a General Purpose Search that scales with more compute, and in particular the model can now adaptively think for longer on harder problems.
In essence, OpenAI figured out how to implement General Purpose Search scalably:
https://www.lesswrong.com/posts/6mysMAqvo9giHC4iX/what-s-general-purpose-search-and-why-might-we-expect-to-see
It unlocked a new inference scaling law, which in particular means that more compute can reliably solve more problems at inference.
This makes AI capabilities harder to contain, since it’s easier to have large inference runs than large training runs.