O3 scores higher on FrontierMath than the top graduate students
I’d guess that’s basically false. In particular, I’d guess that:
o3 probably does outperform mediocre grad students, but not actual top grad students. This guess is based on generalization from GPQA: I personally tried 5 GPQA problems in different fields at a workshop and got 4 of them correct, whereas the benchmark designers claim the rates at which PhD students get them right are much lower than that. I think the resolution is that the benchmark designers tested on very mediocre grad students, and probably the same is true of the FrontierMath benchmark.
the amount of time humans spend on the problem is a big factor—human performance has compounding returns on the scale of hours invested, whereas o3′s performance basically doesn’t have compounding returns in that way. (There was a graph floating around which showed this pretty clearly, but I don’t have it on hand at the moment.) So plausibly o3 outperforms humans who are not given much time, but not humans who spend a full day or two on each problem.
I bet o3 does actually score higher on FrontierMath than the math grad students best at math research, but not higher than math grad students best at doing competition math problems (e.g. hard IMO) and at quickly solving math problems in arbitrary domains. I think around 25% of FrontierMath is hard IMO like problems and this is probably mostly what o3 is solving. See here for context.
Quantitatively, maybe o3 is in roughly the top 1% for US math grad students on FrontierMath? (Perhaps roughly top 200?)
I think one of the other problems with benchmarks is that they necessarily select for formulaic/uninteresting problems that we fundamentally know how to solve. If a mathematician figured out something genuinely novel and important, it wouldn’t go into a benchmark (even if it were initially intended for a benchmark), it’d go into a math research paper. Same for programmers figuring out some usefully novel architecture/algorithmic improvement. Graduate students don’t have a bird’s-eye-view on the entirety of human knowledge, so they have to actually do the work, but the LLM just modifies the near-perfect-fit answer from an obscure publication/math.stackexchange thread or something.
Which perhaps suggests a better way to do math evals is to scope out a set of novel math publications made after a given knowledge-cutoff data, and see if the new model can replicate those? (Though this also needs to be done carefully, since tons of publications are also trivial and formulaic.)
I’d guess that’s basically false. In particular, I’d guess that:
o3 probably does outperform mediocre grad students, but not actual top grad students. This guess is based on generalization from GPQA: I personally tried 5 GPQA problems in different fields at a workshop and got 4 of them correct, whereas the benchmark designers claim the rates at which PhD students get them right are much lower than that. I think the resolution is that the benchmark designers tested on very mediocre grad students, and probably the same is true of the FrontierMath benchmark.
the amount of time humans spend on the problem is a big factor—human performance has compounding returns on the scale of hours invested, whereas o3′s performance basically doesn’t have compounding returns in that way. (There was a graph floating around which showed this pretty clearly, but I don’t have it on hand at the moment.) So plausibly o3 outperforms humans who are not given much time, but not humans who spend a full day or two on each problem.
I bet o3 does actually score higher on FrontierMath than the math grad students best at math research, but not higher than math grad students best at doing competition math problems (e.g. hard IMO) and at quickly solving math problems in arbitrary domains. I think around 25% of FrontierMath is hard IMO like problems and this is probably mostly what o3 is solving. See here for context.
Quantitatively, maybe o3 is in roughly the top 1% for US math grad students on FrontierMath? (Perhaps roughly top 200?)
I think one of the other problems with benchmarks is that they necessarily select for formulaic/uninteresting problems that we fundamentally know how to solve. If a mathematician figured out something genuinely novel and important, it wouldn’t go into a benchmark (even if it were initially intended for a benchmark), it’d go into a math research paper. Same for programmers figuring out some usefully novel architecture/algorithmic improvement. Graduate students don’t have a bird’s-eye-view on the entirety of human knowledge, so they have to actually do the work, but the LLM just modifies the near-perfect-fit answer from an obscure publication/math.stackexchange thread or something.
Which perhaps suggests a better way to do math evals is to scope out a set of novel math publications made after a given knowledge-cutoff data, and see if the new model can replicate those? (Though this also needs to be done carefully, since tons of publications are also trivial and formulaic.)
Maybe you want:
Though worth noting here that the AI is using best of K and individual trajectories saturate without some top-level aggregation scheme.
It might be more illuminating to look at labor cost vs performance which looks like: