OpenAI reports that o3-mini with high reasoning and a Python tool receives a 32% on FrontierMath. However, Epoch’s official evaluation[1] received only 11%.
There are a few reasons to trust Epoch’s score over OpenAIs:
Epoch built the benchmark and has better incentives.
OpenAI reported a 28% score on the hardest of the three problem tiers—suspiciously close to their overall score.
Epoch has published quite a bit of information about its testing infrastructure and data, whereas OpenAI has published close to none.
Edited in Addendum: Epoch has this to say in their FAQ:
The difference between our results and OpenAI’s might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time compute, or because those results were run on a different subset of FrontierMath (the 180 problems in frontiermath-2024-11-26 vs the 290 problems in frontiermath-2025-02-28-private).
FrontierMath Score of o3-mini Much Lower Than Claimed
OpenAI reports that o3-mini with high reasoning and a Python tool receives a 32% on FrontierMath. However, Epoch’s official evaluation[1] received only 11%.
There are a few reasons to trust Epoch’s score over OpenAIs:
Epoch built the benchmark and has better incentives.
OpenAI reported a 28% score on the hardest of the three problem tiers—suspiciously close to their overall score.
Epoch has published quite a bit of information about its testing infrastructure and data, whereas OpenAI has published close to none.
Edited in Addendum:
Epoch has this to say in their FAQ:
Which had Python access.