I have played a little bit with OpenAI’s new iteration of GPT, GPT-o1, which performs an initial reasoning step before running the LLM. It is certainly a more capable tool than previous iterations, though still struggling with the most advanced research mathematical tasks.
Here are some concrete experiments (with a prototype version of the model that I was granted access to). In https://chatgpt.com/share/2ecd7b73-3607-46b3-b855-b29003333b87 I repeated an experiment from https://mathstodon.xyz/@tao/109948249160170335 in which I asked GPT to answer a vaguely worded mathematical query which could be solved by identifying a suitable theorem (Cramer’s theorem) from the literature. Previously, GPT was able to mention some relevant concepts but the details were hallucinated nonsense. This time around, Cramer’s theorem was identified and a perfectly satisfactory answer was given.
In https://chatgpt.com/share/94152e76-7511-4943-9d99-1118267f4b2b I gave the new model a challenging complex analysis problem (which I had previously asked GPT4 to assist in writing up a proof of in https://chatgpt.com/share/63c5774a-d58a-47c2-9149-362b05e268b4 ). Here the results were better than previous models, but still slightly disappointing: the new model could work its way to a correct (and well-written) solution *if* provided a lot of hints and prodding, but did not generate the key conceptual ideas on its own, and did make some non-trivial mistakes. The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, graduate student. However, this was an improvement over previous models, whose capability was closer to an actually incompetent graduate student. It may only take one or two further iterations of improved capability (and integration with other tools, such as computer algebra packages and proof assistants) until the level of “competent graduate student” is reached, at which point I could see this tool being of significant use in research level tasks.
As a third experiment, I asked (in https://chatgpt.com/share/bb0b1cfa-63f6-44bb-805e-8c224f8b9205) the new model to begin the task of formalizing a result in Lean (specifically, to establish one form of the prime number theorem as a consequence of another) by breaking it up into sublemmas which it would formalize the statement of, but not the proof. Here, the results were promising in that the model understood the task well and performed a sensible initial breakdown of the problem, but was inhibited by the lack of up-to-date information on Lean and its math library in its training, with its code containing several mistakes. However, I could imagine a model of this capability that was specifically finetuned on Lean and Mathlib, and integrated into an IDE, being extremely useful in formalization projects.
I’ve run a few experiments of my own, trying to get it to contribute to some “philosophically challenging” mathematical research in agent foundations, and I think Terence’s take is pretty much correct. It’s pretty good at sufficiently well-posed problems – even if they’re very mathematically challenging – but it doesn’t have good “research taste”/creative competence. Including for subproblems that it runs into as part of an otherwise well-posed problem.
Which isn’t to say it’s a nothingburger: some of the motions it makes in its hidden CoT are quite scary, and represent nontrivial/qualitative progress. But the claims of “it reasons at the level of a PhD candidate!” are true only in a very limited sense.
(One particular pattern that impressed me could be seen in the Math section:
Similarly, since s(x) is of degree…
Let me compute the degree of s(x)
Consider: it started outputting a thought that was supposed to fetch some fact which it didn’t infer yet. Then, instead of hallucinating/inventing something on the fly to complete the thought, it realized that it didn’t figure that part out yet, then stopped itself, and swerved to actually computing the fact. Very non-LLM-like!)
Terry Tao on o1:
I’ve run a few experiments of my own, trying to get it to contribute to some “philosophically challenging” mathematical research in agent foundations, and I think Terence’s take is pretty much correct. It’s pretty good at sufficiently well-posed problems – even if they’re very mathematically challenging – but it doesn’t have good “research taste”/creative competence. Including for subproblems that it runs into as part of an otherwise well-posed problem.
Which isn’t to say it’s a nothingburger: some of the motions it makes in its hidden CoT are quite scary, and represent nontrivial/qualitative progress. But the claims of “it reasons at the level of a PhD candidate!” are true only in a very limited sense.
(One particular pattern that impressed me could be seen in the Math section:
Consider: it started outputting a thought that was supposed to fetch some fact which it didn’t infer yet. Then, instead of hallucinating/inventing something on the fly to complete the thought, it realized that it didn’t figure that part out yet, then stopped itself, and swerved to actually computing the fact. Very non-LLM-like!)