Great post, thanks for writing it; I agree with the broad point.
I think I am more or less the perfect target audience for FrontierMath results, and as I said above, I would have no idea how to update on the AIs’ math abilities if it came out tomorrow that they are getting 60% on FrontierMath.
This describes my position well, too: I was surprised by how well the o3 models performed on FM, and also surprised by how hard it’s to map this into how good they are at math in common sense terms.
I further have slight additional information from contributing problems to FM, but it seems to me that the problems vary greatly in guessability. E.g. Daniel Litt writes that he didn’t full internalize the requirement of guess-proofness, whereas for me this was a critical design constraint I actively tracked when crafting problems. The problems also vary greatly in the depth vs. breadth of skills they require (another aspect Litt highlights). This heterogeneity makes it hard to get a sense of what 30% or 60% or 85% performance means.
I find your example in footnote 3 striking: I do think this problem is easy and also very standard. (Funnily enough, I have written training material that illustrates this particular method[1], and I’ve certainly seen it writing elsewhere as well.) Which again illustrates just how hard it’s to make advance predictions about which problems the models will or won’t be able to solve—even “routine application of a standard-ish math competition method” doesn’t imply that o3-mini will solve it.
I also feel exhaustion about how hard it’s to get answer to the literal question of “how well does model X perform on FrontierMath?” As you write, OpenAI reports 32%, whereas Epoch AI reports 11%. A twenty-one percentage point difference, a 3x ratio in success rate!? Man, I understand that capability elicitation is hard, but this is Not Great.[2]
That OpenAI is likely (at least indirectly) hill-climbing on FM doesn’t help matters either[3], and the exclusivity of the deal presumably rules out possibilities like “publish problems once all frontier models are able to solve them so people can see what sort of problems they can reliably solve”.
I was already skeptical of the theory of change of “Mathematicians look at the example problems, get a feel of how hard they are, then tell the world how impressive an X% score is”. But I further updated downward on this when I noticed that the very first public FrontierMath example problem (Artin primitive root conjecture) is just non-sense as stated,[8][9] and apparently no one reported this to the authors before I did a few days ago.
(I’m the author of the mentioned problem.)
There indeed was a just-non-sense formula in the problem statement, which I’m grateful David pointed out (and which is now fixed on Epoch AI’s website). I think flagging the problem itself as just non-sense is too strong, though. I’ve heard that models have tried approaches that give approximately correct answers, so it seems that they basically understood what I intended to write from the context.
That said, this doesn’t undermine the point David was making about information (not) propagating via mathematicians.
- ^
In Finnish, Tehtävä 22.3 here.
- ^
Added on March 15th: This difference is probably largely from OpenAI reporting scores for the best internal version they have and Epoch AI reporting for the publicly available model, and that one just can’t get the 32% level performance with the public version—see Elliot’s comment below.
- ^
There’s been talk of Epoch AI having a subset they keep private from OpenAI, but evaluation results for that set don’t seem to be public. (I initially got the opposite impression, but the confusingly-named FrontierMath-2025-02-28-Private isn’t it.)
As school ends for the summer vacation in Finland, people typically sing a particular song (“suvivirsi” ~ “summer psalm”). The song is religious, which makes many people oppose the practice, but it’s also a nostalgic tradition, which makes many people support the practice. And so, as one might expect, it’s discussed every once in a while in e.g. mainstream newspapers with no end in sight.
As another opinion piece came out recently, a friend talked to me about it. He said something along the lines: “The people who write opinion pieces against the summer psalm are adults. Children see it differently”. And what I interpreted was the subtext there was “You don’t see children being against the summer psalm, but it’s always the adults. Weird, huh?”
I thought this was obviously invalid: surely one shouldn’t expect the opinion pieces to be written by children!
(I didn’t say this out loud, though. I was pretty frustrated by what I thought was bizarre argumentation, but couldn’t articulate my position in a snappy one-liner in the heat of the moment. So I instead resorted to the snappier—but still true—argument “when I was a kid I found singing the summer psalm uncomfortable”.)
This is a situation where it would have been nice to have the concepts “kodo” and “din” be common knowledge. If the two different worlds are “adults dislike the summer psalm, but children don’t mind it” and “both adults and children dislike the summer psalm”, then you’d expect the opinion pieces to be written by adults in either case. It’s not kodo, it’s din.
I don’t think this example is captured by the words “signal” and “noise” or the concept of signal-to-noise ratio. Even if I try to squint at it, describing my friend as focusing on noise seems confusing and counter-productive.