It’s unclear to what extent the capability advances brought about by moving from LLMs to o1/3-style stuff generalize beyond math and programming (i. e., domains in which it’s easy to set up RL training loops based on machine-verifiable ground-truth).
Empirical evidence: “vibes-based evals” of o1 hold that it’s much better than standard LLMs in those domains, but is at best as good as Sonnet 3.5.1 outside them. Theoretical justification: if there are easy-to-specify machine verifiers, then the “correct” solution for the SGD to find is to basically just copy these verifiers into the model’s forward passes. And if we can’t use our program/theorem-verifiers to verify the validity of our real-life plans, it’d stand to reason the corresponding SGD-found heuristics won’t generalize to real-life stuff either.
Math/programming capabilities were coupled to general performance in the “just scale up the pretraining” paradigm: bigger models were generally smarter. It’s unclear whether the same coupling holds for the “just scale up the inference-compute” paradigm; I’ve seen no evidence of that so far.
The claim that “progress from o1 to o3 was only three months” is likely false/misleading. The talk of Q*/Strawberry was around since the board drama of November 2023, at which point it had already supposedly beat some novel math benchmarks. So o1, or a meaningfully capable prototype of it, was around for more than a year now. They’ve only chosen to announce and release it three months ago. (See e. g. gwern’s related analysis here.)
o3, by contrast, seems to be their actual current state-of-the-art model, which they’ve only recently trained. They haven’t been sitting on it for months, haven’t spent months making it ready/efficient enough for a public release.
Hence the illusion of insanely fast progress. (Which was probably exactly OpenAI’s aim.)
I’m open to be corrected on any of these claims if anyone has relevant data, of course.
Can’t we just count from announcement to announcement? Like sure, they were working on stuff before o1 prior to having o1 work, but they are always going to be working on the next thing.
Do you think that o1 wasn’t the best model (of this type) that OpenAI had internally at the point of the o1 announcement? If so, do you think that o3 isn’t the best model (of this type) that OpenAI has internally now?
If your answers differ (including quantitatively), why?
The main exception is that o3 might be based on a different base model which could imply that a bunch of the gains are from earlier scaling.
I don’t think counting from announcement to announcement is valid here, no. They waited to announce o1 until they had o1-mini and o1-preview ready to ship: i. e., until they’ve already came around to optimizing these models for compute-efficiency and to setting up the server infrastructure for running them. That couldn’t have taken zero time. Separately, there’s evidence they’ve had them in-house for a long time, between the Q* rumors from a year ago and the Orion/Strawberry rumors from a few months ago.
This is not the case for o3. At the very least, it is severely unoptimized, taking thousands of dollars per task (i. e., it’s not even ready for the hypothetical $2000/month subscription they floated).
That is,
Do you think that o1 wasn’t the best model (of this type) that OpenAI had internally at the point of the o1 announcement? If so, do you think that o3 isn’t the best model (of this type) that OpenAI has internally now?
Yes and yes.
The case for “o3 is the best they currently have in-house” is weaker, admittedly. But even if it’s not the case, and they already have “o4″ internally, the fact that o1 (or powerful prototypes) existed well before the September announcement seem strongly confirmed, and that already disassembles the narrative of “o1 to o3 took three months”.
I concur with all of this.
Two other points:
It’s unclear to what extent the capability advances brought about by moving from LLMs to o1/3-style stuff generalize beyond math and programming (i. e., domains in which it’s easy to set up RL training loops based on machine-verifiable ground-truth).
Empirical evidence: “vibes-based evals” of o1 hold that it’s much better than standard LLMs in those domains, but is at best as good as Sonnet 3.5.1 outside them. Theoretical justification: if there are easy-to-specify machine verifiers, then the “correct” solution for the SGD to find is to basically just copy these verifiers into the model’s forward passes. And if we can’t use our program/theorem-verifiers to verify the validity of our real-life plans, it’d stand to reason the corresponding SGD-found heuristics won’t generalize to real-life stuff either.
Math/programming capabilities were coupled to general performance in the “just scale up the pretraining” paradigm: bigger models were generally smarter. It’s unclear whether the same coupling holds for the “just scale up the inference-compute” paradigm; I’ve seen no evidence of that so far.
The claim that “progress from o1 to o3 was only three months” is likely false/misleading. The talk of Q*/Strawberry was around since the board drama of November 2023, at which point it had already supposedly beat some novel math benchmarks. So o1, or a meaningfully capable prototype of it, was around for more than a year now. They’ve only chosen to announce and release it three months ago. (See e. g. gwern’s related analysis here.)
o3, by contrast, seems to be their actual current state-of-the-art model, which they’ve only recently trained. They haven’t been sitting on it for months, haven’t spent months making it ready/efficient enough for a public release.
Hence the illusion of insanely fast progress. (Which was probably exactly OpenAI’s aim.)
I’m open to be corrected on any of these claims if anyone has relevant data, of course.
Can’t we just count from announcement to announcement? Like sure, they were working on stuff before o1 prior to having o1 work, but they are always going to be working on the next thing.
Do you think that o1 wasn’t the best model (of this type) that OpenAI had internally at the point of the o1 announcement? If so, do you think that o3 isn’t the best model (of this type) that OpenAI has internally now?
If your answers differ (including quantitatively), why?
The main exception is that o3 might be based on a different base model which could imply that a bunch of the gains are from earlier scaling.
I don’t think counting from announcement to announcement is valid here, no. They waited to announce o1 until they had o1-mini and o1-preview ready to ship: i. e., until they’ve already came around to optimizing these models for compute-efficiency and to setting up the server infrastructure for running them. That couldn’t have taken zero time. Separately, there’s evidence they’ve had them in-house for a long time, between the Q* rumors from a year ago and the Orion/Strawberry rumors from a few months ago.
This is not the case for o3. At the very least, it is severely unoptimized, taking thousands of dollars per task (i. e., it’s not even ready for the hypothetical $2000/month subscription they floated).
That is,
Yes and yes.
The case for “o3 is the best they currently have in-house” is weaker, admittedly. But even if it’s not the case, and they already have “o4″ internally, the fact that o1 (or powerful prototypes) existed well before the September announcement seem strongly confirmed, and that already disassembles the narrative of “o1 to o3 took three months”.