The difference between our results and OpenAI’s might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time compute, or because those results were run on a different subset of FrontierMath (the 180 problems in frontiermath-2024-11-26 vs the 290 problems in frontiermath-2025-02-28-private).
That definitely sounds like OpenAI training on (or perhaps constructing a scaffold around) the part of the benchmark Epoch shared with them.
That definitely sounds like OpenAI training on (or perhaps constructing a scaffold around) the part of the benchmark Epoch shared with them.