AIME 2025 part I was conducted yesterday, and the scores of some language models are available here: https://matharena.ai thanks to @mbalunovic, @ni_jovanovic et al.
I have to say I was impressed, as I predicted the smaller distilled models would crash and burn, but they actually scored at a reasonable 25-50%.
That was surprising to me! Since these are new problems, not seen during training, right? I expected smaller models to barely score above 0%. It’s really hard to believe that a 1.5B model can solve pre-math olympiad problems when it can’t multiply 3-digit numbers. I was wrong, I guess.
I then used openai’s Deep Research to see if similar problems to those in AIME 2025 exist on the internet. And guess what? An identical problem to Q1 of AIME 2025 exists on Quora:
I haven’t checked beyond that because the freaking p-value is too low already. Problems near identical to the test set can be found online.
So, what—if anything—does this imply for Math benchmarks? And what does it imply for all the sudden hill climbing due to RL?
I’m not certain, and there is a reasonable argument that even if something in the train-set contains near-identical but not exact copies of test data, it’s still generalization. I am sympathetic to that. But, I also wouldn’t rule out that GRPO is amazing at sharpening memories along with math skills.
At the very least, the above show that data decontamination is hard.
Never ever underestimate the amount of stuff you can find online. Practically everything exists online.
I think one of the other problems with benchmarks is that they necessarily select for formulaic/uninteresting problems that we fundamentally know how to solve. If a mathematician figured out something genuinely novel and important, it wouldn’t go into a benchmark (even if it were initially intended for a benchmark), it’d go into a math research paper. Same for programmers figuring out some usefully novel architecture/algorithmic improvement. Graduate students don’t have a bird’s-eye-view on the entirety of human knowledge, so they have to actually do the work, but the LLM just modifies the near-perfect-fit answer from an obscure publication/math.stackexchange thread or something.
I expect the same is the case with programming benchmarks, science-quiz benchmarks, et cetera.
Now, this doesn’t necessarily mean that the AI progress has been largely illusory and that we’re way further from AGI than the AI hype men would have you believe (although I am very tempted to make this very pleasant claim, and I do place plenty of probability mass on it).
But if you’re scoring AIs by the problems they succeed at, rather than the problems they fail at, you’re likely massively overestimating their actual problem-solving capabilities.
To be fair, non-reasoning models do much worse on these questions, even when it’s likely that the same training data has already been given to GPT-4.
Now, I could believe that RL is more or less working as an elicitation method, which plausibly explains the results, though still it’s interesting to see why they get much better scores, even with very similar training data.
I do think RL works “as intended” to some extent, teaching models some actual reasoning skills, much like SSL works “as intended” to some extent, chiseling-in some generalizable knowledge. The question is to what extent it’s one or the other.
Some more evidence that whatever the AI progress on benchmarks is measuring, it’s likely not measuring what you think it’s measuring:
I expected that:
I expect the same is the case with programming benchmarks, science-quiz benchmarks, et cetera.
Now, this doesn’t necessarily mean that the AI progress has been largely illusory and that we’re way further from AGI than the AI hype men would have you believe (although I am very tempted to make this very pleasant claim, and I do place plenty of probability mass on it).
But if you’re scoring AIs by the problems they succeed at, rather than the problems they fail at, you’re likely massively overestimating their actual problem-solving capabilities.
To be fair, non-reasoning models do much worse on these questions, even when it’s likely that the same training data has already been given to GPT-4.
Now, I could believe that RL is more or less working as an elicitation method, which plausibly explains the results, though still it’s interesting to see why they get much better scores, even with very similar training data.
I do think RL works “as intended” to some extent, teaching models some actual reasoning skills, much like SSL works “as intended” to some extent, chiseling-in some generalizable knowledge. The question is to what extent it’s one or the other.