we didn’t go do specific work [targeting ARC-AGI]; this is just the general effort.
François Chollet (in the blogpost with the graph) said
Note on “tuned”: OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.
The version of the model we tested was domain-adapted to ARC-AGI via the public training set (which is what the public training set is for). As far as I can tell they didn’t generate synthetic ARC data to improve their score.
also: the model we used for all of our o3 evals is fully general; a subset of the arc-agi public training set was a tiny fraction of the broader o3 train distribution, and we didn’t do any additional domain-specific fine-tuning on the final checkpoint
So on ARC-AGI they just pretrained on 300 examples (75% of the 400 in the public training set). Performance is surprisingly good.
OpenAI shared they trained the o3 we tested on 75% of the Public Training set
Probably a dataset for RL, that is the model was trained to try and try again to solve these tests with long chains of reasoning, not just tuned or pretrained on them, as a detail like 75% of examples sounds like a test-centric dataset design decision, with the other 25% going to the validation part of the dataset.
Altman: “didn’t go do specific work … just the general effort”
Seems plausible they trained on ALL the tests, specifically targeting various tests. The public part of ARC-AGI is “just” a part of that dataset of all the tests. Could be some part of explaining the o1/o3 difference in $20 tier.
Thank you so much for your research! I would have never found these statements.
I’m still quite suspicious. Why would they be “including a (subset of) the public training set”? Is it accidental data contamination? They don’t say so. Do they think simply including some questions and answers without reinforcement learning or reasoning would help the model solve other such questions? That’s possible but not very likely.
Were they “including a (subset of) the public training set” in o3′s base training data? Or in o3′s reinforcement learning problem/answer sets?
Altman never said “we didn’t go do specific work [targeting ARC-AGI]; this is just the general effort.”
Instead he said,
Worth mentioning that we didn’t, we target and we think it’s an awesome benchmark but we didn’t go do specific [inaudible]--you know, the general rule, but yeah really appreciate the partnership this was a fun one to do.
The gist I get is that he admits to targeting it but that OpenAI targets all kinds of problem/answer sets for reinforcement learning, not just ARC’s public training set. It felt like he didn’t want to talk about this too much, from the way he interrupted himself and changed the topic without clarifying what he meant.
The other sources do sort of imply no reinforcement learning. I’ll wait to see if they make a clearer denial of reinforcement learning, rather than a “nondenial denial” which can be reinterpreted as “we didn’t fine-tune o3 in the sense we didn’t use a separate derivative of o3 (that’s fine-tuned for just the test) to take the test.”
My guess is o3 is tuned using the training set, since François Chollet (developer of ARC) somehow decided to label o3 as “tuned” and OpenAI isn’t racing to correct this.
I’m still quite suspicious. Why would they be “including a (subset of) the public training set”?
The answer is the ARC prize allowed them to do this, and the test is designed in such a way such that unlocking the training set cannot allow you to do well on the test set.
[ARC-AGI] is designed in such a way such that unlocking the training set cannot allow you to do well on the test set
I don’t not whether I would put it this strongly. I haven’t looked deep into it, but isn’t it basically a non-verbal IQ test? Those very much do have a kind of “character” to them, such that studying how they work in general can let you derive plenty of heuristics for solving them. Those heuristics would be pretty abstract, yet far below the abstraction level of “general intelligence” (or the pile of very-abstract heuristics we associate with “general intelligence”).
I agree that tuning the model using the public training set does not automatically unlock the rest of it! But the Kaggle SOTA is clearly better than OpenAI’s o1 according to the test. This is seen vividly in François Chollet’s graph.
No one claims this means the Kaggle models are smarter than o1, nor that the test completely fails to test intelligence since the Kaggle models rank higher than o1.
Why does no one seem to be arguing for either? Probably because of the unspoken understanding that they are doing two versions of the test. One where the model fits the public training set, and tries to predict on the private test set. And two where you have a generally intelligent model which happens to be able to do this test. When people compare different models using the test, they are implicitly using the second version of the test.
Most generative AI models did the harder second version, but o3 (and the Kaggle versions) did the first version, which—annoyingly to me—is the official version. It’s still not right to compare other models’ scores with o3′s score.
I think there is a third explanation here. The Kaggle model (probably) does well because you can brute force it with a bag of heuristics and gradually iterate by discarding ones that don’t work and keeping the ones that do.
Wow it does say the test set problems are harder than the training set problems. I didn’t expect that.
But it’s not an enormous difference: the example model that got 53% on the public training set got 38% on the public test set. It got only 24% on the private test set, even though it’s supposed to be equally hard, maybe because “trial and error” fitted the model to the public test set as well as the public training set.
I think the Kaggle models might have the human design the heuristics while o3 discovers heuristics on its own during RL (unless it was trained on human reasoning on the ARC training set?).
o3′s “AI designed heuristics” might let it learn a far more of heuristics than humans can think of and verify, while the Kaggle models’ “human designed heuristics” might require less AI technology and compute. I don’t actually know how the Kaggle models work, I’m guessing.
I finally looked at the Kaggle models and I guess it is similar to RL for o3.
I agree. I think the Kaggle models have more advantages than o3. I think they have far more human design and fine-tuning than o3. One can almost argue that some Kaggle models are very slightly trained on the test set, in the sense the humans making them learn from test sets results, and empirically discover what improves such results.
o3′s defeating the Kaggle models is very impressive, but o3′s results shouldn’t be directly compared against other untuned models.
One can almost argue that some Kaggle models are very slightly trained on the test set
I’d say they’re more-than-trained on the test set. My understanding is that humans were essentially able to do an architecture search, picking the best architecture for handling the test set, and then also put in whatever detailed heuristics they wanted into it based on studying the test set (including by doing automated heuristics search using SGD, it’s all fair game). So they’re not “very slightly” trained, they’re trained^2.
Arguably the same is the case for o3, of course. ML researchers are using benchmarks as targets, and while they may not be directly trying to Goodhart to them, there’s still a search process over architectures-plus-training-loops whose termination condition is “the model beats a new benchmark”. And SGD itself is, in some ways, a much better programmer than any human.
So o3′s development and training process essentially contained the development-and-training process for Kaggle models. They’ve iteratively searched for an architecture that can be trained to beat several benchmarks, then did so.
They did this on the far easier training set though?
An alternative story is they trained until a model was found that could beat the training set but many other benchmarks too, implying that there may be some general intelligence factor there. Maybe this is still goodharting on benchmarks but there’s probably truly something there.
There are degrees of Goodharting. It’s not Goodharting to ARC-AGI specifically, but it is optimizing for performance on the array of easily-checkable benchmarks. Which plausibly have some common factor between them to which you could “Goodhart”; i. e., a way to get good at them without actually training generality.
The key question is “how much of the performance is due to ARC-AGI data.”
If the untuned o3 was anywhere as good as the tuned o3, why didn’t they test it and publish it? If the most important and interesting test result is somehow omitted, take things with a grain of salt.
I admit that running the test is extremely expensive, but there should be compromises like running the cheaper version or only doing a few questions.
Edit: oh that reply seems to deny reinforcement learning or at least “fine tuning.” I don’t understand why François Chollet calls the model “tuned” then. Maybe wait for more information I guess.
Edit again: I’m still not sure yet. They might be denying that it’s a separate version of o3 finetuned to do ARC questions, while not denying they did reinforcement learning on the ARC public training set. I guess a week or so later we might find out what “tuned” truly means.
edit: wait likely it’s RL; I’m confused
OpenAI didn’t fine-tune on ARC-AGI, even though this graph suggests they did.
Sources:
Altman said
François Chollet (in the blogpost with the graph) said
and
An OpenAI staff member replied
and further confirmed that “tuned” in the graph is
Another OpenAI staff member said
So on ARC-AGI they just pretrained on 300 examples (75% of the 400 in the public training set). Performance is surprisingly good.
[heavily edited after first posting]
Probably a dataset for RL, that is the model was trained to try and try again to solve these tests with long chains of reasoning, not just tuned or pretrained on them, as a detail like 75% of examples sounds like a test-centric dataset design decision, with the other 25% going to the validation part of the dataset.
Seems plausible they trained on ALL the tests, specifically targeting various tests. The public part of ARC-AGI is “just” a part of that dataset of all the tests. Could be some part of explaining the o1/o3 difference in $20 tier.
Thank you so much for your research! I would have never found these statements.
I’m still quite suspicious. Why would they be “including a (subset of) the public training set”? Is it accidental data contamination? They don’t say so. Do they think simply including some questions and answers without reinforcement learning or reasoning would help the model solve other such questions? That’s possible but not very likely.
Were they “including a (subset of) the public training set” in o3′s base training data? Or in o3′s reinforcement learning problem/answer sets?
Altman never said “we didn’t go do specific work [targeting ARC-AGI]; this is just the general effort.”
Instead he said,
The gist I get is that he admits to targeting it but that OpenAI targets all kinds of problem/answer sets for reinforcement learning, not just ARC’s public training set. It felt like he didn’t want to talk about this too much, from the way he interrupted himself and changed the topic without clarifying what he meant.
The other sources do sort of imply no reinforcement learning. I’ll wait to see if they make a clearer denial of reinforcement learning, rather than a “nondenial denial” which can be reinterpreted as “we didn’t fine-tune o3 in the sense we didn’t use a separate derivative of o3 (that’s fine-tuned for just the test) to take the test.”
My guess is o3 is tuned using the training set, since François Chollet (developer of ARC) somehow decided to label o3 as “tuned” and OpenAI isn’t racing to correct this.
The answer is the ARC prize allowed them to do this, and the test is designed in such a way such that unlocking the training set cannot allow you to do well on the test set.
I don’t not whether I would put it this strongly. I haven’t looked deep into it, but isn’t it basically a non-verbal IQ test? Those very much do have a kind of “character” to them, such that studying how they work in general can let you derive plenty of heuristics for solving them. Those heuristics would be pretty abstract, yet far below the abstraction level of “general intelligence” (or the pile of very-abstract heuristics we associate with “general intelligence”).
I agree that tuning the model using the public training set does not automatically unlock the rest of it! But the Kaggle SOTA is clearly better than OpenAI’s o1 according to the test. This is seen vividly in François Chollet’s graph.
No one claims this means the Kaggle models are smarter than o1, nor that the test completely fails to test intelligence since the Kaggle models rank higher than o1.
Why does no one seem to be arguing for either? Probably because of the unspoken understanding that they are doing two versions of the test. One where the model fits the public training set, and tries to predict on the private test set. And two where you have a generally intelligent model which happens to be able to do this test. When people compare different models using the test, they are implicitly using the second version of the test.
Most generative AI models did the harder second version, but o3 (and the Kaggle versions) did the first version, which—annoyingly to me—is the official version. It’s still not right to compare other models’ scores with o3′s score.
I think there is a third explanation here. The Kaggle model (probably) does well because you can brute force it with a bag of heuristics and gradually iterate by discarding ones that don’t work and keeping the ones that do.
Do you not consider that ultimately isomorphic to what o3 does?
No, I believe there is a human in the loop for the above if that’s not clear.
You’ve said it in another comment. But this is probably an “architecture search”.
I guess the training loop for o3 is similar but it would be on the easier training set instead of the far harder test set.
Wow it does say the test set problems are harder than the training set problems. I didn’t expect that.
But it’s not an enormous difference: the example model that got 53% on the public training set got 38% on the public test set. It got only 24% on the private test set, even though it’s supposed to be equally hard, maybe because “trial and error” fitted the model to the public test set as well as the public training set.
The other example model got 32%, 30%, and 22%.
I think the Kaggle models might have the human design the heuristics while o3 discovers heuristics on its own during RL (unless it was trained on human reasoning on the ARC training set?).o3′s “AI designed heuristics” might let it learn a far more of heuristics than humans can think of and verify, while the Kaggle models’ “human designed heuristics” might require less AI technology and compute. I don’t actually know how the Kaggle models work, I’m guessing.I finally looked at the Kaggle models and I guess it is similar to RL for o3.
I agree. I think the Kaggle models have more advantages than o3. I think they have far more human design and fine-tuning than o3. One can almost argue that some Kaggle models are very slightly trained on the test set, in the sense the humans making them learn from test sets results, and empirically discover what improves such results.
o3′s defeating the Kaggle models is very impressive, but o3′s results shouldn’t be directly compared against other untuned models.
I’d say they’re more-than-trained on the test set. My understanding is that humans were essentially able to do an architecture search, picking the best architecture for handling the test set, and then also put in whatever detailed heuristics they wanted into it based on studying the test set (including by doing automated heuristics search using SGD, it’s all fair game). So they’re not “very slightly” trained, they’re trained^2.
Arguably the same is the case for o3, of course. ML researchers are using benchmarks as targets, and while they may not be directly trying to Goodhart to them, there’s still a search process over architectures-plus-training-loops whose termination condition is “the model beats a new benchmark”. And SGD itself is, in some ways, a much better programmer than any human.
So o3′s development and training process essentially contained the development-and-training process for Kaggle models. They’ve iteratively searched for an architecture that can be trained to beat several benchmarks, then did so.
They did this on the far easier training set though?
An alternative story is they trained until a model was found that could beat the training set but many other benchmarks too, implying that there may be some general intelligence factor there. Maybe this is still goodharting on benchmarks but there’s probably truly something there.
There are degrees of Goodharting. It’s not Goodharting to ARC-AGI specifically, but it is optimizing for performance on the array of easily-checkable benchmarks. Which plausibly have some common factor between them to which you could “Goodhart”; i. e., a way to get good at them without actually training generality.
See my other comment instead.
The key question is “how much of the performance is due to ARC-AGI data.”If the untuned o3 was anywhere as good as the tuned o3, why didn’t they test it and publish it? If the most important and interesting test result is somehow omitted, take things with a grain of salt.I admit that running the test is extremely expensive, but there should be compromises like running the cheaper version or only doing a few questions.Edit: oh that reply seems to deny reinforcement learning or at least “fine tuning.” I don’t understand why François Chollet calls the model “tuned” then. Maybe wait for more information I guess.Edit again: I’m still not sure yet. They might be denying that it’s a separate version of o3 finetuned to do ARC questions, while not denying they did reinforcement learning on the ARC public training set.I guess a week or so later we might find out what “tuned” truly means.[edited more]