I want to bring an interesting new benchmark to your attention, as o1-preview currently scores very badly on the benchmark, which might or might not be solved, but at present, no one does better than 5-10%:
1⁄10 Today we’re launching FrontierMath, a benchmark for evaluating advanced mathematical reasoning in AI. We collaborated with 60+ leading mathematicians to create hundreds of original, exceptionally challenging math problems, of which current AI systems solve less than 2%.
It seems unsurprising to me that there are benchmarks o1-preview is bad at. I don’t mean to suggest that it can do general reasoning in a highly consistent and correct way on arbitrarily hard problems[1]; I expect that it still has the same sorts of reliability issues as other LLMs (though probably less often), and some of the same difficulty building and using internal models without inconsistency, and that there are also individual reasoning steps that are beyond its ability. My only claim here is that o1-preview knocks down the best evidence that I knew of that LLMs can’t do general reasoning at all on novel problems.
I think that to many people that claim may just look obvious; of course LLMs are doing some degree of general reasoning. But the evidence against was strong enough that there was a reasonable possibility that what looked like general reasoning was still relatively shallow inference over a vast knowledge base. Not the full stochastic parrot view, but the claim that LLMs are much less general than they naively seem.
It’s fascinatingly difficult to come up with unambiguous evidence that LLMs are doing true general reasoning! I hope that my upcoming project on whether LLMs can do scientific research on toy novel domains can help provide that evidence. It’ll be interesting to see how many skeptics are convinced by that project or by the evidence shown in this post, and how many maintain their skepticism.
I want to bring an interesting new benchmark to your attention, as o1-preview currently scores very badly on the benchmark, which might or might not be solved, but at present, no one does better than 5-10%:
https://x.com/EpochAIResearch/status/1854993676524831046
Thanks!
It seems unsurprising to me that there are benchmarks o1-preview is bad at. I don’t mean to suggest that it can do general reasoning in a highly consistent and correct way on arbitrarily hard problems[1]; I expect that it still has the same sorts of reliability issues as other LLMs (though probably less often), and some of the same difficulty building and using internal models without inconsistency, and that there are also individual reasoning steps that are beyond its ability. My only claim here is that o1-preview knocks down the best evidence that I knew of that LLMs can’t do general reasoning at all on novel problems.
I think that to many people that claim may just look obvious; of course LLMs are doing some degree of general reasoning. But the evidence against was strong enough that there was a reasonable possibility that what looked like general reasoning was still relatively shallow inference over a vast knowledge base. Not the full stochastic parrot view, but the claim that LLMs are much less general than they naively seem.
It’s fascinatingly difficult to come up with unambiguous evidence that LLMs are doing true general reasoning! I hope that my upcoming project on whether LLMs can do scientific research on toy novel domains can help provide that evidence. It’ll be interesting to see how many skeptics are convinced by that project or by the evidence shown in this post, and how many maintain their skepticism.
And I don’t expect that you hold that view either; your comment just inspired some clarification on my part.
I definitely agree with that.
To put this in a little perspective, o1 does give the most consistent performance so far, and arguably the strongest in a fair competition:
https://x.com/MatthewJBar/status/1855002593115939302