FrontierMath Score of o3-mini Much Lower Than Claimed

OpenAI reports that o3-mini with high reasoning and a Python tool receives a 32% on FrontierMath. However, Epoch’s official evaluation[1] received only 11%.

There are a few reasons to trust Epoch’s score over OpenAIs:

  • Epoch built the benchmark and has better incentives.

  • OpenAI reported a 28% score on the hardest of the three problem tiers—suspiciously close to their overall score.

  • Epoch has published quite a bit of information about its testing infrastructure and data, whereas OpenAI has published close to none.

Edited in Addendum:
Epoch has this to say in their FAQ:

The difference between our results and OpenAI’s might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time compute, or because those results were run on a different subset of FrontierMath (the 180 problems in frontiermath-2024-11-26 vs the 290 problems in frontiermath-2025-02-28-private).

  1. ^

    Which had Python access.