I haven’t had warm receptions when critiquing points, which has frustratingly left me with bad detection for when I’m being nasty, so if I sound thorny it’s not my intent.
Somewhere I think you might have misstepped is the frontier math questions: the quotes you’ve heard are almost certainly about tier 3 questions, the hardest ones meant for math researchers in training. The mid tier is for grad student level problems and tier 1 is bright high schooler to undergrad level problems
Tier 1: 25% of the test
Tier 2: 50% of the test
Tier 3: 25%
O3 got 25%, probably answering none of the hard questions and suspiciously matching almost exactly the proportion of easy questions. From some, there seems to be disagreement about whether tier 2 questions are consistently harder than tier 1 questions
Regardless, some (especially easier) problems are of the sort that can be verified and have explicitly been said to have instantaneously recognizable solutions. This is not an incredibly flattering picture of o3
THIS IS THE END OF WHERE I THINK YOU WERE MISTAKEN, TEXT PAST THIS IS MORE CONJECTURE AND/OR NOT DIRECTLY REFUTING YOU
the ARC test looks like it was taken by an overfit model. If the test creators are right, then the arc test for an 85 percent off a tuned model and probably spamming conclusions that it could verify, it trained on 75 percent of the questions from what I understand so one of that score seems like memorization and a mildly okay score on the 25 percent that was held as test data.
And this part is damning: the arc-2 test which is the success of to the first one, made by the same people, gets a 95 percent pass rate form humans (so easier than the 85 percent pass rate of the first test) but o3′s score dropped to a 30%, a 55% drop and now 65% below human on a similar test made by the same people.
Let me be clear: if that isn’t VERY inaccurate, then this is irrefutably a cooked test and o3 is overfit to the point of invalidating the results for any kind of generalizability.
There are other problems, like the fact that this pruning search method is really, really bad for some problems and that it seems to ride on validation being somewhat easy in order to work at all but that’s not material to the benchmarks
I can cite sources if these are important points, not obviously incorrect, etc, I might write my first post about it if I’m digging that much!
I haven’t had warm receptions when critiquing points, which has frustratingly left me with bad detection for when I’m being nasty, so if I sound thorny it’s not my intent.
Somewhere I think you might have misstepped is the frontier math questions: the quotes you’ve heard are almost certainly about tier 3 questions, the hardest ones meant for math researchers in training. The mid tier is for grad student level problems and tier 1 is bright high schooler to undergrad level problems
Tier 1: 25% of the test
Tier 2: 50% of the test
Tier 3: 25%
O3 got 25%, probably answering none of the hard questions and suspiciously matching almost exactly the proportion of easy questions. From some, there seems to be disagreement about whether tier 2 questions are consistently harder than tier 1 questions
Regardless, some (especially easier) problems are of the sort that can be verified and have explicitly been said to have instantaneously recognizable solutions. This is not an incredibly flattering picture of o3
THIS IS THE END OF WHERE I THINK YOU WERE MISTAKEN, TEXT PAST THIS IS MORE CONJECTURE AND/OR NOT DIRECTLY REFUTING YOU
the ARC test looks like it was taken by an overfit model. If the test creators are right, then the arc test for an 85 percent off a tuned model and probably spamming conclusions that it could verify, it trained on 75 percent of the questions from what I understand so one of that score seems like memorization and a mildly okay score on the 25 percent that was held as test data.
And this part is damning: the arc-2 test which is the success of to the first one, made by the same people, gets a 95 percent pass rate form humans (so easier than the 85 percent pass rate of the first test) but o3′s score dropped to a 30%, a 55% drop and now 65% below human on a similar test made by the same people.
Let me be clear: if that isn’t VERY inaccurate, then this is irrefutably a cooked test and o3 is overfit to the point of invalidating the results for any kind of generalizability.
There are other problems, like the fact that this pruning search method is really, really bad for some problems and that it seems to ride on validation being somewhat easy in order to work at all but that’s not material to the benchmarks
I can cite sources if these are important points, not obviously incorrect, etc, I might write my first post about it if I’m digging that much!