There’s a math paper by Ghrist, Gould and Lopez which was produced with a nontrivial amount of LLM assistance, as described in its Appendix A and by Ghrist in this thread (but see also this response).
The LLM contributions to the paper don’t seem especially impressive. The presentation is less “we used this technology in a real project because it saved us time by doing our work for us,” and more “we’re enthusiastic and curious about this technology and its future potential, and we used it in a real project because we’re enthusiasts who use it in whatever we do and/or because we wanted learn more about its current strengths and weaknesses.”
And I imagine it doesn’t “count” for your purposes.
But – assuming that this work doesn’t count – I’d be interested to hear more about why it doesn’t count, and how far away it is from the line, and what exactly the disqualifying features are.
Reading the appendix and Ghrist’s thread, it doesn’t sound like the main limitation of the LLMs here was an inability to think up new ideas (while being comparatively good at routine calculations using standard methods). If anything, the opposite is true: the main contributions credited to the LLMs were...
Coming up with an interesting conjecture
Finding a “clearer and more elegant” proof of the conjecture than the one the human authors had devised themselves (and doing so from scratch, without having seen the human-written proof)
...while, on the other hand, the LLMs often wrote proofs that were just plain wrong, and the proof in (2) was manually selected from amongst a lot of dross by the human authors.
To be more explicit, I think that that the (human) process of “generating novel insights” in math often involves a lot of work that resembles brute-force or evolutionary search. E.g. you ask yourself something like “how could I generalize this?”, think up 5 half-baked ideas that feel like generalizations, think about each one more carefully, end up rejecting 4 of them as nonsensical/trivial/wrong/etc., continue to pursue the 5th one, realize it’s also unworkable but also notice that in the process of finding that out you ended up introducing some kind-of-cool object or trick that you hadn’t seen before, try to generalize or formalize this “kind-of-cool” thing (forgetting the original problem entirely), etc. etc.
And I can imagine a fruitful human-LLM collaborative workflow in which the LLM specializes more in candidate generation – thinking up lots of different next steps that at least might be interesting and valuable, even if most of them will end up being nonsensical/trivial/wrong/etc. – while the human does more of the work of filtering out unpromising paths and “fully baking” promising but half-baked LLM-generated ideas. (Indeed, I think this is basically how Ghrist is using LLMs already.)
If this workflow eventually produces a “novel insight,” I don’t see why we should attribute that insight completely to the human and not at all to the LLM; it seems more accurate to say that it was co-created by the human and the LLM, with work that normally occurs within a single human mind now divvied up between two collaborating entities.
(And if we keep attributing these insights wholly to the humans up until the point at which the LLM becomes capable of doing all the stuff the human was doing, we’ll experience this as a surprising step-change, whereas we might have been able to see it coming if we had acknowledged that the LLM was already doing a lot of what is called “having insights” when humans do it – just not doing the entirety of that process by itself, autonomously.)
If this kind of approach to mathematics research becomes mainstream, out-competing humans working alone, that would be pretty convincing. So there is nothing that disqualifies this example—it does update me slightly.
However, this example on its own seems unconvincing for a couple of reasons:
it seems that the results were in fact proven by humans first, calling into question the claim that the proof insight belonged to the LLM (even though the authors try to frame it that way).
from the reply on X it seems that the results of the paper may not have been novel? In that case, it’s hard to view this as evidence for LLMs accelerating mathematical research.
I have personally attempted to interactively use LLMs in my research process, though NOT with anything like this degree of persistence. My impression is that it becomes very easy to feel that the LLM is “almost useful” but after endless attempts it never actually becomes useful for mathematical research (it can be useful for other things like rapidly prototyping or debugging code). My suspicion is that this feeling of “almost usefulness” is an illusion; here’s a related comment from my shortform: https://www.lesswrong.com/posts/RnKmRusmFpw7MhPYw/cole-wyeth-s-shortform?commentId=Fx49CwbrH7ucmsYhD
Does this paper look more like mathematicians experimented with an LLM to try and get useful intellectual labor out of it, resulting in some curiosities but NOT accelerating their work, or does it look like they adopted the LLM for practical reasons? If it’s the former, it seems to fall under the category of proving a conjecture that was constructed for that purpose (to be proven by an LLM).
There’s a math paper by Ghrist, Gould and Lopez which was produced with a nontrivial amount of LLM assistance, as described in its Appendix A and by Ghrist in this thread (but see also this response).
The LLM contributions to the paper don’t seem especially impressive. The presentation is less “we used this technology in a real project because it saved us time by doing our work for us,” and more “we’re enthusiastic and curious about this technology and its future potential, and we used it in a real project because we’re enthusiasts who use it in whatever we do and/or because we wanted learn more about its current strengths and weaknesses.”
And I imagine it doesn’t “count” for your purposes.
But – assuming that this work doesn’t count – I’d be interested to hear more about why it doesn’t count, and how far away it is from the line, and what exactly the disqualifying features are.
Reading the appendix and Ghrist’s thread, it doesn’t sound like the main limitation of the LLMs here was an inability to think up new ideas (while being comparatively good at routine calculations using standard methods). If anything, the opposite is true: the main contributions credited to the LLMs were...
Coming up with an interesting conjecture
Finding a “clearer and more elegant” proof of the conjecture than the one the human authors had devised themselves (and doing so from scratch, without having seen the human-written proof)
...while, on the other hand, the LLMs often wrote proofs that were just plain wrong, and the proof in (2) was manually selected from amongst a lot of dross by the human authors.
To be more explicit, I think that that the (human) process of “generating novel insights” in math often involves a lot of work that resembles brute-force or evolutionary search. E.g. you ask yourself something like “how could I generalize this?”, think up 5 half-baked ideas that feel like generalizations, think about each one more carefully, end up rejecting 4 of them as nonsensical/trivial/wrong/etc., continue to pursue the 5th one, realize it’s also unworkable but also notice that in the process of finding that out you ended up introducing some kind-of-cool object or trick that you hadn’t seen before, try to generalize or formalize this “kind-of-cool” thing (forgetting the original problem entirely), etc. etc.
And I can imagine a fruitful human-LLM collaborative workflow in which the LLM specializes more in candidate generation – thinking up lots of different next steps that at least might be interesting and valuable, even if most of them will end up being nonsensical/trivial/wrong/etc. – while the human does more of the work of filtering out unpromising paths and “fully baking” promising but half-baked LLM-generated ideas. (Indeed, I think this is basically how Ghrist is using LLMs already.)
If this workflow eventually produces a “novel insight,” I don’t see why we should attribute that insight completely to the human and not at all to the LLM; it seems more accurate to say that it was co-created by the human and the LLM, with work that normally occurs within a single human mind now divvied up between two collaborating entities.
(And if we keep attributing these insights wholly to the humans up until the point at which the LLM becomes capable of doing all the stuff the human was doing, we’ll experience this as a surprising step-change, whereas we might have been able to see it coming if we had acknowledged that the LLM was already doing a lot of what is called “having insights” when humans do it – just not doing the entirety of that process by itself, autonomously.)
If this kind of approach to mathematics research becomes mainstream, out-competing humans working alone, that would be pretty convincing. So there is nothing that disqualifies this example—it does update me slightly.
However, this example on its own seems unconvincing for a couple of reasons:
it seems that the results were in fact proven by humans first, calling into question the claim that the proof insight belonged to the LLM (even though the authors try to frame it that way).
from the reply on X it seems that the results of the paper may not have been novel? In that case, it’s hard to view this as evidence for LLMs accelerating mathematical research.
I have personally attempted to interactively use LLMs in my research process, though NOT with anything like this degree of persistence. My impression is that it becomes very easy to feel that the LLM is “almost useful” but after endless attempts it never actually becomes useful for mathematical research (it can be useful for other things like rapidly prototyping or debugging code). My suspicion is that this feeling of “almost usefulness” is an illusion; here’s a related comment from my shortform: https://www.lesswrong.com/posts/RnKmRusmFpw7MhPYw/cole-wyeth-s-shortform?commentId=Fx49CwbrH7ucmsYhD
Does this paper look more like mathematicians experimented with an LLM to try and get useful intellectual labor out of it, resulting in some curiosities but NOT accelerating their work, or does it look like they adopted the LLM for practical reasons? If it’s the former, it seems to fall under the category of proving a conjecture that was constructed for that purpose (to be proven by an LLM).