I think there is a third explanation here. The Kaggle model (probably) does well because you can brute force it with a bag of heuristics and gradually iterate by discarding ones that don’t work and keeping the ones that do.
Wow it does say the test set problems are harder than the training set problems. I didn’t expect that.
But it’s not an enormous difference: the example model that got 53% on the public training set got 38% on the public test set. It got only 24% on the private test set, even though it’s supposed to be equally hard, maybe because “trial and error” fitted the model to the public test set as well as the public training set.
I think the Kaggle models might have the human design the heuristics while o3 discovers heuristics on its own during RL (unless it was trained on human reasoning on the ARC training set?).
o3′s “AI designed heuristics” might let it learn a far more of heuristics than humans can think of and verify, while the Kaggle models’ “human designed heuristics” might require less AI technology and compute. I don’t actually know how the Kaggle models work, I’m guessing.
I finally looked at the Kaggle models and I guess it is similar to RL for o3.
I agree. I think the Kaggle models have more advantages than o3. I think they have far more human design and fine-tuning than o3. One can almost argue that some Kaggle models are very slightly trained on the test set, in the sense the humans making them learn from test sets results, and empirically discover what improves such results.
o3′s defeating the Kaggle models is very impressive, but o3′s results shouldn’t be directly compared against other untuned models.
One can almost argue that some Kaggle models are very slightly trained on the test set
I’d say they’re more-than-trained on the test set. My understanding is that humans were essentially able to do an architecture search, picking the best architecture for handling the test set, and then also put in whatever detailed heuristics they wanted into it based on studying the test set (including by doing automated heuristics search using SGD, it’s all fair game). So they’re not “very slightly” trained, they’re trained^2.
Arguably the same is the case for o3, of course. ML researchers are using benchmarks as targets, and while they may not be directly trying to Goodhart to them, there’s still a search process over architectures-plus-training-loops whose termination condition is “the model beats a new benchmark”. And SGD itself is, in some ways, a much better programmer than any human.
So o3′s development and training process essentially contained the development-and-training process for Kaggle models. They’ve iteratively searched for an architecture that can be trained to beat several benchmarks, then did so.
They did this on the far easier training set though?
An alternative story is they trained until a model was found that could beat the training set but many other benchmarks too, implying that there may be some general intelligence factor there. Maybe this is still goodharting on benchmarks but there’s probably truly something there.
There are degrees of Goodharting. It’s not Goodharting to ARC-AGI specifically, but it is optimizing for performance on the array of easily-checkable benchmarks. Which plausibly have some common factor between them to which you could “Goodhart”; i. e., a way to get good at them without actually training generality.
I think there is a third explanation here. The Kaggle model (probably) does well because you can brute force it with a bag of heuristics and gradually iterate by discarding ones that don’t work and keeping the ones that do.
Do you not consider that ultimately isomorphic to what o3 does?
No, I believe there is a human in the loop for the above if that’s not clear.
You’ve said it in another comment. But this is probably an “architecture search”.
I guess the training loop for o3 is similar but it would be on the easier training set instead of the far harder test set.
Wow it does say the test set problems are harder than the training set problems. I didn’t expect that.
But it’s not an enormous difference: the example model that got 53% on the public training set got 38% on the public test set. It got only 24% on the private test set, even though it’s supposed to be equally hard, maybe because “trial and error” fitted the model to the public test set as well as the public training set.
The other example model got 32%, 30%, and 22%.
I think the Kaggle models might have the human design the heuristics while o3 discovers heuristics on its own during RL (unless it was trained on human reasoning on the ARC training set?).o3′s “AI designed heuristics” might let it learn a far more of heuristics than humans can think of and verify, while the Kaggle models’ “human designed heuristics” might require less AI technology and compute. I don’t actually know how the Kaggle models work, I’m guessing.I finally looked at the Kaggle models and I guess it is similar to RL for o3.
I agree. I think the Kaggle models have more advantages than o3. I think they have far more human design and fine-tuning than o3. One can almost argue that some Kaggle models are very slightly trained on the test set, in the sense the humans making them learn from test sets results, and empirically discover what improves such results.
o3′s defeating the Kaggle models is very impressive, but o3′s results shouldn’t be directly compared against other untuned models.
I’d say they’re more-than-trained on the test set. My understanding is that humans were essentially able to do an architecture search, picking the best architecture for handling the test set, and then also put in whatever detailed heuristics they wanted into it based on studying the test set (including by doing automated heuristics search using SGD, it’s all fair game). So they’re not “very slightly” trained, they’re trained^2.
Arguably the same is the case for o3, of course. ML researchers are using benchmarks as targets, and while they may not be directly trying to Goodhart to them, there’s still a search process over architectures-plus-training-loops whose termination condition is “the model beats a new benchmark”. And SGD itself is, in some ways, a much better programmer than any human.
So o3′s development and training process essentially contained the development-and-training process for Kaggle models. They’ve iteratively searched for an architecture that can be trained to beat several benchmarks, then did so.
They did this on the far easier training set though?
An alternative story is they trained until a model was found that could beat the training set but many other benchmarks too, implying that there may be some general intelligence factor there. Maybe this is still goodharting on benchmarks but there’s probably truly something there.
There are degrees of Goodharting. It’s not Goodharting to ARC-AGI specifically, but it is optimizing for performance on the array of easily-checkable benchmarks. Which plausibly have some common factor between them to which you could “Goodhart”; i. e., a way to get good at them without actually training generality.