It seems unsurprising to me that there are benchmarks o1-preview is bad at. I don’t mean to suggest that it can do general reasoning in a highly consistent and correct way on arbitrarily hard problems[1]; I expect that it still has the same sorts of reliability issues as other LLMs (though probably less often), and some of the same difficulty building and using internal models without inconsistency, and that there are also individual reasoning steps that are beyond its ability. My only claim here is that o1-preview knocks down the best evidence that I knew of that LLMs can’t do general reasoning at all on novel problems.
I think that to many people that claim may just look obvious; of course LLMs are doing some degree of general reasoning. But the evidence against was strong enough that there was a reasonable possibility that what looked like general reasoning was still relatively shallow inference over a vast knowledge base. Not the full stochastic parrot view, but the claim that LLMs are much less general than they naively seem.
It’s fascinatingly difficult to come up with unambiguous evidence that LLMs are doing true general reasoning! I hope that my upcoming project on whether LLMs can do scientific research on toy novel domains can help provide that evidence. It’ll be interesting to see how many skeptics are convinced by that project or by the evidence shown in this post, and how many maintain their skepticism.
Thanks!
It seems unsurprising to me that there are benchmarks o1-preview is bad at. I don’t mean to suggest that it can do general reasoning in a highly consistent and correct way on arbitrarily hard problems[1]; I expect that it still has the same sorts of reliability issues as other LLMs (though probably less often), and some of the same difficulty building and using internal models without inconsistency, and that there are also individual reasoning steps that are beyond its ability. My only claim here is that o1-preview knocks down the best evidence that I knew of that LLMs can’t do general reasoning at all on novel problems.
I think that to many people that claim may just look obvious; of course LLMs are doing some degree of general reasoning. But the evidence against was strong enough that there was a reasonable possibility that what looked like general reasoning was still relatively shallow inference over a vast knowledge base. Not the full stochastic parrot view, but the claim that LLMs are much less general than they naively seem.
It’s fascinatingly difficult to come up with unambiguous evidence that LLMs are doing true general reasoning! I hope that my upcoming project on whether LLMs can do scientific research on toy novel domains can help provide that evidence. It’ll be interesting to see how many skeptics are convinced by that project or by the evidence shown in this post, and how many maintain their skepticism.
And I don’t expect that you hold that view either; your comment just inspired some clarification on my part.
I definitely agree with that.
To put this in a little perspective, o1 does give the most consistent performance so far, and arguably the strongest in a fair competition:
https://x.com/MatthewJBar/status/1855002593115939302