It seems very strange to me to say that they cheated, when the public training set is intended to be used exactly for training. They did what the test specified! And they didn’t even use all of it.
The whole point of the test is that some training examples aren’t going to unlock the rest of it. What training definitely does it teach the model how to output the JSON in the right format, and likely how to think about what to even do with these visual puzzles.
Do we say that humans aren’t a general intelligence even though for ~all valuable tasks, you have to take some time to practice, or someone has to show you, before you can do it well?
When I first wrote the post I did make the mistake of writing they were cheating :( sorry about that.
A few hours I noticed the mistake and removed the statements, put the word “cheating” in quotes and explained it at the end.
To be fair, this isn’t really cheating in the sense they are allowed to use data, and it’s why it’s called a “public training set.” But the version of the test that allows this, is not the AGI test.
It’s possible you saw the old version due to browser caches.
Again, I’m sorry.
I think my main point still stands.
I disagree that “The whole point of the test is that some training examples aren’t going to unlock the rest of it. What training definitely does it teach the model how to output the JSON in the right format, and likely how to think about what to even do with these visual puzzles.”
I don’t think poor performance on benchmarks by SOTA generative AI models are due to failing to understand output formatting, and that models should need example questions in their training data (or reinforcement learning sets) to compensate for this. Instead, a good benchmark should explain the formatting clearly to the model, maybe with examples in the input context.
I agree that tuning the model using the public training set does not automatically unlock the rest of it! But I strongly disagree that this is the whole point of the test. If it was, then the Kaggle SOTA is clearly better than OpenAI’s o1 according to the test. This is seen vividly in François Chollet’s graph.
No one claims this means the Kaggle models are smarter than o1, nor that the test completely fails to test intelligence since the Kaggle models rank higher than o1.
Why does no one seem to be arguing for either? Probably because of the unspoken understanding that they are doing two versions of the test. One where the model fits the public training set, and tries to predict on the private test set. And two where you have a generally intelligent model which happens to be able to do this test. When people compare different models using the test, they are implicitly using the second version of the test.
Most generative AI models did the harder second version, but o3 (and the Kaggle versions) did the first version, which—annoyingly to me—is the official version. It’s still not right to compare other models’ scores with o3′s score.
Even if the o3 (and the Kaggle models) “did what the test specified,” they didn’t do what most people who compare AI LLMs with the ARC benchmark are looking for, and has the potential to mislead these people.
The Kaggle models doesn’t mislead these people because they are very clearly not generally intelligent, but o3 does mislead people (maybe accidentally, maybe deliberately).
From what I know, o3 was probably did reinforcement learning (see my comment).
We disagree on this but may agree on other things. I agree o3 is extremely impressive due to its 25.2% Frontier Math score, where the contrast against other models is more genuine (though there is a world of difference between the easiest 25% of questions and the hardest 25% of questions).
It seems very strange to me to say that they cheated, when the public training set is intended to be used exactly for training. They did what the test specified! And they didn’t even use all of it.
The whole point of the test is that some training examples aren’t going to unlock the rest of it. What training definitely does it teach the model how to output the JSON in the right format, and likely how to think about what to even do with these visual puzzles.
Do we say that humans aren’t a general intelligence even though for ~all valuable tasks, you have to take some time to practice, or someone has to show you, before you can do it well?
When I first wrote the post I did make the mistake of writing they were cheating :( sorry about that.
A few hours I noticed the mistake and removed the statements, put the word “cheating” in quotes and explained it at the end.
It’s possible you saw the old version due to browser caches.
Again, I’m sorry.
I think my main point still stands.
I disagree that “The whole point of the test is that some training examples aren’t going to unlock the rest of it. What training definitely does it teach the model how to output the JSON in the right format, and likely how to think about what to even do with these visual puzzles.”
I don’t think poor performance on benchmarks by SOTA generative AI models are due to failing to understand output formatting, and that models should need example questions in their training data (or reinforcement learning sets) to compensate for this. Instead, a good benchmark should explain the formatting clearly to the model, maybe with examples in the input context.
I agree that tuning the model using the public training set does not automatically unlock the rest of it! But I strongly disagree that this is the whole point of the test. If it was, then the Kaggle SOTA is clearly better than OpenAI’s o1 according to the test. This is seen vividly in François Chollet’s graph.
No one claims this means the Kaggle models are smarter than o1, nor that the test completely fails to test intelligence since the Kaggle models rank higher than o1.
Why does no one seem to be arguing for either? Probably because of the unspoken understanding that they are doing two versions of the test. One where the model fits the public training set, and tries to predict on the private test set. And two where you have a generally intelligent model which happens to be able to do this test. When people compare different models using the test, they are implicitly using the second version of the test.
Most generative AI models did the harder second version, but o3 (and the Kaggle versions) did the first version, which—annoyingly to me—is the official version. It’s still not right to compare other models’ scores with o3′s score.
Even if the o3 (and the Kaggle models) “did what the test specified,” they didn’t do what most people who compare AI LLMs with the ARC benchmark are looking for, and has the potential to mislead these people.
The Kaggle models doesn’t mislead these people because they are very clearly not generally intelligent, but o3 does mislead people (maybe accidentally, maybe deliberately).
From what I know, o3 was probably did reinforcement learning (see my comment).
We disagree on this but may agree on other things. I agree o3 is extremely impressive due to its 25.2% Frontier Math score, where the contrast against other models is more genuine (though there is a world of difference between the easiest 25% of questions and the hardest 25% of questions).