Just trying to follow along… here’s where I’m at with a bear case that we haven’t seen evidence that o3 is an immediate harbinger of real transformative AGI:
Codeforces is based in part on wall clock time. And we all already knew that, if AI can do something at all, it can probably do it much faster than humans. So it’s a valid comparison to previous models but not straightforwardly a comparison to top human coders.
FrontierMath is 25% tier 1 (least hard), 50% tier 2, 25% tier 3 (most hard). Terence Tao’s quote about the problems being hard was just tier 3. Tier 1 is IMO/Putnam level maybe. Also, some of even the tier 2 problems allegedly rely on straightforward application of specialized knowledge, rather than cleverness, such that a mathematician could “immediately” know how to do it (see this tweet). Even many IMO/Putnam problems are minor variations on a problem that someone somewhere has written down and is thus in the training data. So o3’s 25.2% result doesn’t really prove much in terms of a comparison to human mathematicians, although again it’s clearly an advance over previous models.
ARC-AGI — we already knew that many of the ARC-AGI questions are solvable by enumerating lots and lots of hypotheses and checking them (“crude program enumeration”), and the number of tokens that o3 used to solve the problems (55k per solution?) suggests that o3 is still doing that to a significant extent. Now, in terms of comparing to humans, I grant that there’s some fungibility between insight (coming up with promising hypothesis) and brute force (enumerating lots of hypotheses and checking them). Deep Blue beat Kasparov at chess by checking 200 million moves per second. You can call it cheating, but Deep Blue still won the game. If future AGI similarly beats humans at novel science and technology and getting around the open-ended real world etc. via less insight and more brute force, then we humans can congratulate ourselves for our still-unmatched insight, but who cares, AGI is still beating us at novel science and technology etc. On the other hand, brute force worked for chess but didn’t work in Go, because the combinatorial explosion of possibilities blows up faster in Go than in chess. Plausibly the combinatorial explosion of possible ideas and concepts and understanding in the open-ended real world blows up even faster yet. ARC-AGI is a pretty constrained universe; intuitively, it seems more on the chess end of the chess-Go spectrum, such that brute force hypothesis-enumeration evidently works reasonably well in ARC-AGI. But (on this theory) that approach wouldn’t generalize to capability at novel science and technology and getting around in the open-ended real world etc.
(I really haven’t been paying close attention and I’m open to correction.)
They did this pretty quickly and were able to greatly improve performance on a moderately diverse range of pretty checkable tasks. This implies OpenAI likely has an RL pipeline which can be scaled up to substantially better performance by putting in easily checkable tasks + compute + algorithmic improvements. And, given that this is RL, there isn’t any clear reason this won’t work (with some additional annoyances) for scaling through very superhuman performance (edit: in these checkable domains).[1]
Credit to @Tao Lin for talking to me about this take.
given that this is RL, there isn’t any clear reason this won’t work (with some additional annoyances) for scaling through very superhuman performance
Not where they don’t have a way of generating verifiable problems. Improvement where they merely have some human-written problems is likely bounded by their amount.
Yeah, sorry this is an important caveat. But, I think very superhuman performance in most/all checkable domains is pretty spooky and this is even putting aside how it generalizes.
It’s unclear to what extent the capability advances brought about by moving from LLMs to o1/3-style stuff generalize beyond math and programming (i. e., domains in which it’s easy to set up RL training loops based on machine-verifiable ground-truth).
Empirical evidence: “vibes-based evals” of o1 hold that it’s much better than standard LLMs in those domains, but is at best as good as Sonnet 3.5.1 outside them. Theoretical justification: if there are easy-to-specify machine verifiers, then the “correct” solution for the SGD to find is to basically just copy these verifiers into the model’s forward passes. And if we can’t use our program/theorem-verifiers to verify the validity of our real-life plans, it’d stand to reason the corresponding SGD-found heuristics won’t generalize to real-life stuff either.
Math/programming capabilities were coupled to general performance in the “just scale up the pretraining” paradigm: bigger models were generally smarter. It’s unclear whether the same coupling holds for the “just scale up the inference-compute” paradigm; I’ve seen no evidence of that so far.
The claim that “progress from o1 to o3 was only three months” is likely false/misleading. The talk of Q*/Strawberry was around since the board drama of November 2023, at which point it had already supposedly beat some novel math benchmarks. So o1, or a meaningfully capable prototype of it, was around for more than a year now. They’ve only chosen to announce and release it three months ago. (See e. g. gwern’s related analysis here.)
o3, by contrast, seems to be their actual current state-of-the-art model, which they’ve only recently trained. They haven’t been sitting on it for months, haven’t spent months making it ready/efficient enough for a public release.
Hence the illusion of insanely fast progress. (Which was probably exactly OpenAI’s aim.)
I’m open to be corrected on any of these claims if anyone has relevant data, of course.
Can’t we just count from announcement to announcement? Like sure, they were working on stuff before o1 prior to having o1 work, but they are always going to be working on the next thing.
Do you think that o1 wasn’t the best model (of this type) that OpenAI had internally at the point of the o1 announcement? If so, do you think that o3 isn’t the best model (of this type) that OpenAI has internally now?
If your answers differ (including quantitatively), why?
The main exception is that o3 might be based on a different base model which could imply that a bunch of the gains are from earlier scaling.
I don’t think counting from announcement to announcement is valid here, no. They waited to announce o1 until they had o1-mini and o1-preview ready to ship: i. e., until they’ve already came around to optimizing these models for compute-efficiency and to setting up the server infrastructure for running them. That couldn’t have taken zero time. Separately, there’s evidence they’ve had them in-house for a long time, between the Q* rumors from a year ago and the Orion/Strawberry rumors from a few months ago.
This is not the case for o3. At the very least, it is severely unoptimized, taking thousands of dollars per task (i. e., it’s not even ready for the hypothetical $2000/month subscription they floated).
That is,
Do you think that o1 wasn’t the best model (of this type) that OpenAI had internally at the point of the o1 announcement? If so, do you think that o3 isn’t the best model (of this type) that OpenAI has internally now?
Yes and yes.
The case for “o3 is the best they currently have in-house” is weaker, admittedly. But even if it’s not the case, and they already have “o4″ internally, the fact that o1 (or powerful prototypes) existed well before the September announcement seem strongly confirmed, and that already disassembles the narrative of “o1 to o3 took three months”.
I’m not questioning whether o3 is a big advance over previous models—it obviously is! I was trying to address some suggestions / vibe in the air (example) that o3 is strong evidence that the singularity is nigh, not just that there is rapid ongoing AI progress. In that context, I haven’t seen people bringing up SWE-bench as much as those other three that I mentioned, although it’s possible I missed it. Mostly I see people bringing up SWE-bench in the context of software jobs.
I was figuring that the SWE-bench tasks don’t seem particularly hard, intuitively. E.g. 90% of SWE-bench verified problems are “estimated to take less than an hour for an experienced software engineer to complete”. And a lot more people have the chops to become an “experienced software engineer” than to become able to solve FrontierMath problems or get in the top 200 in the world on Codeforces. So the latter sound extra impressive, and that’s what I was responding to.
I mean, fair but when did a benchmark designed to test REAL software engineering issues that take less than an hour suddenly stop seeming “particularly hard” for a computer.
Just trying to follow along… here’s where I’m at with a bear case that we haven’t seen evidence that o3 is an immediate harbinger of real transformative AGI:
Codeforces is based in part on wall clock time. And we all already knew that, if AI can do something at all, it can probably do it much faster than humans. So it’s a valid comparison to previous models but not straightforwardly a comparison to top human coders.
FrontierMath is 25% tier 1 (least hard), 50% tier 2, 25% tier 3 (most hard). Terence Tao’s quote about the problems being hard was just tier 3. Tier 1 is IMO/Putnam level maybe. Also, some of even the tier 2 problems allegedly rely on straightforward application of specialized knowledge, rather than cleverness, such that a mathematician could “immediately” know how to do it (see this tweet). Even many IMO/Putnam problems are minor variations on a problem that someone somewhere has written down and is thus in the training data. So o3’s 25.2% result doesn’t really prove much in terms of a comparison to human mathematicians, although again it’s clearly an advance over previous models.
ARC-AGI — we already knew that many of the ARC-AGI questions are solvable by enumerating lots and lots of hypotheses and checking them (“crude program enumeration”), and the number of tokens that o3 used to solve the problems (55k per solution?) suggests that o3 is still doing that to a significant extent. Now, in terms of comparing to humans, I grant that there’s some fungibility between insight (coming up with promising hypothesis) and brute force (enumerating lots of hypotheses and checking them). Deep Blue beat Kasparov at chess by checking 200 million moves per second. You can call it cheating, but Deep Blue still won the game. If future AGI similarly beats humans at novel science and technology and getting around the open-ended real world etc. via less insight and more brute force, then we humans can congratulate ourselves for our still-unmatched insight, but who cares, AGI is still beating us at novel science and technology etc. On the other hand, brute force worked for chess but didn’t work in Go, because the combinatorial explosion of possibilities blows up faster in Go than in chess. Plausibly the combinatorial explosion of possible ideas and concepts and understanding in the open-ended real world blows up even faster yet. ARC-AGI is a pretty constrained universe; intuitively, it seems more on the chess end of the chess-Go spectrum, such that brute force hypothesis-enumeration evidently works reasonably well in ARC-AGI. But (on this theory) that approach wouldn’t generalize to capability at novel science and technology and getting around in the open-ended real world etc.
(I really haven’t been paying close attention and I’m open to correction.)
I think the best bull case is something like:
They did this pretty quickly and were able to greatly improve performance on a moderately diverse range of pretty checkable tasks. This implies OpenAI likely has an RL pipeline which can be scaled up to substantially better performance by putting in easily checkable tasks + compute + algorithmic improvements. And, given that this is RL, there isn’t any clear reason this won’t work (with some additional annoyances) for scaling through very superhuman performance (edit: in these checkable domains).[1]
Credit to @Tao Lin for talking to me about this take.
I express something similar on twitter here.
Not where they don’t have a way of generating verifiable problems. Improvement where they merely have some human-written problems is likely bounded by their amount.
Yeah, sorry this is an important caveat. But, I think very superhuman performance in most/all checkable domains is pretty spooky and this is even putting aside how it generalizes.
I concur with all of this.
Two other points:
It’s unclear to what extent the capability advances brought about by moving from LLMs to o1/3-style stuff generalize beyond math and programming (i. e., domains in which it’s easy to set up RL training loops based on machine-verifiable ground-truth).
Empirical evidence: “vibes-based evals” of o1 hold that it’s much better than standard LLMs in those domains, but is at best as good as Sonnet 3.5.1 outside them. Theoretical justification: if there are easy-to-specify machine verifiers, then the “correct” solution for the SGD to find is to basically just copy these verifiers into the model’s forward passes. And if we can’t use our program/theorem-verifiers to verify the validity of our real-life plans, it’d stand to reason the corresponding SGD-found heuristics won’t generalize to real-life stuff either.
Math/programming capabilities were coupled to general performance in the “just scale up the pretraining” paradigm: bigger models were generally smarter. It’s unclear whether the same coupling holds for the “just scale up the inference-compute” paradigm; I’ve seen no evidence of that so far.
The claim that “progress from o1 to o3 was only three months” is likely false/misleading. The talk of Q*/Strawberry was around since the board drama of November 2023, at which point it had already supposedly beat some novel math benchmarks. So o1, or a meaningfully capable prototype of it, was around for more than a year now. They’ve only chosen to announce and release it three months ago. (See e. g. gwern’s related analysis here.)
o3, by contrast, seems to be their actual current state-of-the-art model, which they’ve only recently trained. They haven’t been sitting on it for months, haven’t spent months making it ready/efficient enough for a public release.
Hence the illusion of insanely fast progress. (Which was probably exactly OpenAI’s aim.)
I’m open to be corrected on any of these claims if anyone has relevant data, of course.
Can’t we just count from announcement to announcement? Like sure, they were working on stuff before o1 prior to having o1 work, but they are always going to be working on the next thing.
Do you think that o1 wasn’t the best model (of this type) that OpenAI had internally at the point of the o1 announcement? If so, do you think that o3 isn’t the best model (of this type) that OpenAI has internally now?
If your answers differ (including quantitatively), why?
The main exception is that o3 might be based on a different base model which could imply that a bunch of the gains are from earlier scaling.
I don’t think counting from announcement to announcement is valid here, no. They waited to announce o1 until they had o1-mini and o1-preview ready to ship: i. e., until they’ve already came around to optimizing these models for compute-efficiency and to setting up the server infrastructure for running them. That couldn’t have taken zero time. Separately, there’s evidence they’ve had them in-house for a long time, between the Q* rumors from a year ago and the Orion/Strawberry rumors from a few months ago.
This is not the case for o3. At the very least, it is severely unoptimized, taking thousands of dollars per task (i. e., it’s not even ready for the hypothetical $2000/month subscription they floated).
That is,
Yes and yes.
The case for “o3 is the best they currently have in-house” is weaker, admittedly. But even if it’s not the case, and they already have “o4″ internally, the fact that o1 (or powerful prototypes) existed well before the September announcement seem strongly confirmed, and that already disassembles the narrative of “o1 to o3 took three months”.
Good points! I think we underestimate the role that brute force plays in our brains though.
I don’t think you can explain away SWE-bench performance with any of these explanations
I’m not questioning whether o3 is a big advance over previous models—it obviously is! I was trying to address some suggestions / vibe in the air (example) that o3 is strong evidence that the singularity is nigh, not just that there is rapid ongoing AI progress. In that context, I haven’t seen people bringing up SWE-bench as much as those other three that I mentioned, although it’s possible I missed it. Mostly I see people bringing up SWE-bench in the context of software jobs.
I was figuring that the SWE-bench tasks don’t seem particularly hard, intuitively. E.g. 90% of SWE-bench verified problems are “estimated to take less than an hour for an experienced software engineer to complete”. And a lot more people have the chops to become an “experienced software engineer” than to become able to solve FrontierMath problems or get in the top 200 in the world on Codeforces. So the latter sound extra impressive, and that’s what I was responding to.
I mean, fair but when did a benchmark designed to test REAL software engineering issues that take less than an hour suddenly stop seeming “particularly hard” for a computer.
Feels like we’re being frogboiled.