In a recent post, Cole Wyeth makes a bold claim:
. . . there is one crucial test (yes this is a crux) that LLMs have not passed. They have never done anything important.
They haven’t proven any theorems that anyone cares about. They haven’t written anything that anyone will want to read in ten years (or even one year). Despite apparently memorizing more information than any human could ever dream of, they have made precisely zero novel connections or insights in any area of science[3].
An anecdote I heard through the grapevine: some chemist was trying to synthesize some chemical. He couldn’t get some step to work, and tried for a while to find solutions on the internet. He eventually asked an LLM. The LLM gave a very plausible causal story about what was going wrong and suggested a modified setup which, in fact, fixed the problem. The idea seemed so hum-drum that the chemist thought, surely, the idea was actually out there in the world and the LLM had scraped it from the internet. However, the chemist continued searching and, even with the details in hand, could not find anyone talking about this anywhere. Weak conclusion: the LLM actually came up with this idea due to correctly learning a good-enough causal model generalizing not-very-closely-related chemistry ideas in its training set.
Weak conclusion: there are more than precisely zero novel scientific insights in LLMs.
My question is: can anyone confirm the above rumor, or cite any other positive examples of LLMs generating insights which help with a scientific or mathematical project, with those insights not being available anywhere else (ie seemingly absent from the training data)?
Cole Wyeth predicts “no”; though LLMs are able to solve problems which they have not seen by standard methods, they are not capable of performing novel research. I (Abram Demski) find it plausible (but not certain) that the answer is “yes”. This touches on AI timeline questions.
I find it plausible that LLMs can generate such insights, because I think the predictive ground layer of LLMs contains a significant “world-model” triangulated from diffuse information. This “world-model” can contain some insights not present in the training data. I think this paper has some evidence for such a conclusion:
In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs can articulate a definition of and compute inverses.
However, the setup in this paper is obviously artificial, setting up questions that humans already know the answers to, even if they aren’t present in the data. The question is whether LLMs synthesize any new knowledge in this way.
Derya Unutmaz reported that o1-pro came up with a novel idea in the domain of immunotherapy:
There’s more behind the link. I have no relevant expertise that would allow me to evaluate how novel this actually was. But immunology is the author’s specialty with his work having close to 30 000 citations on Google Scholar, so I’d assume him to know what he’s talking about.
Of indirect relevance here is that Derya Unutmaz is an avid OpenAI fan who they trust enough to be an early tester. So while I’m not saying that he’s deliberately dissembling, he is known to be overly enthusiastic about AI, and so any of his vibes-y impressions should be taken with a pound of salt.
Thanks!
Certainly he seems impressed with the models understanding, but did it actually solve a standing problem? Did its suggestions actually work?
This is (also) outside my area of expertise, so need to see the idea verified by reality—or at least by professional consensus outside the project.
Mathematics (and mathematical physics, theoretical computer science, etc.) would be more clear-cut examples because any original ideas from the model could be objectively verified (without actually running experiments). Not to move the goalposts—novel insights in biology or chemistry would also count, its just hard for me to check whether they are significant, or whether models propose hundreds of ideas and most of them fail (e.g. the bottleneck is experimental resources).
I was literally just reading this before seeing your post:
https://www.techspot.com/news/106874-ai-accelerates-superbug-solution-completing-two-days-what.html
So, the LLM generated five hypotheses, one of which the team also agrees with, but has not verified?
The article frames the extra hypotheses as making the results more impressive, but it seems to me that they make the results less impressive—if the LLM generates enough hypotheses, and you already know the answer, one of them is likely to sound like the answer.
As far as I understand from the article, the LLM generated five hypotheses that make sense. One of them is the one that the team has already verified but hadn’t yet published anywhere and another one the team hadn’t even thought of but consider worth investigating.
Assuming the five are a representative sample rather than a small human-curated set of many more hypotheses, I think that’s pretty impressive.
I don’t think this is true in general. Take any problem that is difficult to solve but easy to verify and you aren’t likely to have an LLM guess the answer.
I am skeptical of the claim that the research is unique and hasn’t been published anywhere, and I’d also really like to know the details regarding what they prompted the model with.
The whole co-scientist thing looks really weird. Look at the graph there. Am I misreading it, or people rated it just barely better than raw o1 outputs? How is that consistent with it apparently pulling all of these amazing discoveries out of the air?
Edit: Found (well, Grok 3 found) an article with some more details regarding Penadés’ work. Apparently they did publish a related finding, and did feed it into the AI co-scientist system.
Generalizing, my current take on it is that they – and probably all the other teams that are now reporting amazing results – fed the system a ton of clues regarding the answer, on top of implicitly pre-selecting the problems to be those where they already knew there’s a satisfying solution to be found.
Yeah, my general assumption in these situations is that the article is likely overstating things for a headline and reality is not so clear cut. Skepticism is definitely warranted.
I think DeepMind’s FunSearch result, showing the existence of a Cap Set of size 512 for n=8, might qualify:
https://www.nature.com/articles/s41586-023-06924-6
They used an LLM which generated millions of programs as part of an evolutionary search. A few of these programs were able to generate the size-512 Cap Set. This isn’t a hugely important problem, but there was preexisting interest in it. I don’t think it was particularly low-hanging fruit; there have been some followup papers using alternative scaffolding and LLMs, and the 512 result is not easy to reproduce.
I’ve also done some work on LLM generation of programs that have solved longstanding open instances of combinatorial design problems:
https://arxiv.org/pdf/2501.17725
As noted in the paper though, these do feel a bit more like low-hanging fruit, and these designs probably could have been found by people working to optimize several methods to see if any work. Still, they were recognized open instances, and no one had previously solved before the LLM generated code that constructed the solution.
If this is an example of an LLM proving something, it’s a very non-central example. It was finetuned specifically for mathematics and then used essentially as a program synthesis engine in a larger system that proved the result.
DeepMind can’t just keep running this system and get more theorems out—once the engineers moved on to other projects I haven’t heard anything building on the results.
It’s hard to see what a novel insight is exactly. Any example can be argued against. Can you give an example of one? Or of one you’ve personally had?
Various LLMs can spot issues in code bases that are not public. Do all of these count?
Obviously it’s not a hard line, but your example doesn’t count, and proving any open conjecture in mathematics which was not constructed for the purpose does count. I think the quote from my post gives some other central examples. The standard is conceptual knowledge production.
There’s a math paper by Ghrist, Gould and Lopez which was produced with a nontrivial amount of LLM assistance, as described in its Appendix A and by Ghrist in this thread (but see also this response).
The LLM contributions to the paper don’t seem especially impressive. The presentation is less “we used this technology in a real project because it saved us time by doing our work for us,” and more “we’re enthusiastic and curious about this technology and its future potential, and we used it in a real project because we’re enthusiasts who use it in whatever we do and/or because we wanted learn more about its current strengths and weaknesses.”
And I imagine it doesn’t “count” for your purposes.
But – assuming that this work doesn’t count – I’d be interested to hear more about why it doesn’t count, and how far away it is from the line, and what exactly the disqualifying features are.
Reading the appendix and Ghrist’s thread, it doesn’t sound like the main limitation of the LLMs here was an inability to think up new ideas (while being comparatively good at routine calculations using standard methods). If anything, the opposite is true: the main contributions credited to the LLMs were...
Coming up with an interesting conjecture
Finding a “clearer and more elegant” proof of the conjecture than the one the human authors had devised themselves (and doing so from scratch, without having seen the human-written proof)
...while, on the other hand, the LLMs often wrote proofs that were just plain wrong, and the proof in (2) was manually selected from amongst a lot of dross by the human authors.
To be more explicit, I think that that the (human) process of “generating novel insights” in math often involves a lot of work that resembles brute-force or evolutionary search. E.g. you ask yourself something like “how could I generalize this?”, think up 5 half-baked ideas that feel like generalizations, think about each one more carefully, end up rejecting 4 of them as nonsensical/trivial/wrong/etc., continue to pursue the 5th one, realize it’s also unworkable but also notice that in the process of finding that out you ended up introducing some kind-of-cool object or trick that you hadn’t seen before, try to generalize or formalize this “kind-of-cool” thing (forgetting the original problem entirely), etc. etc.
And I can imagine a fruitful human-LLM collaborative workflow in which the LLM specializes more in candidate generation – thinking up lots of different next steps that at least might be interesting and valuable, even if most of them will end up being nonsensical/trivial/wrong/etc. – while the human does more of the work of filtering out unpromising paths and “fully baking” promising but half-baked LLM-generated ideas. (Indeed, I think this is basically how Ghrist is using LLMs already.)
If this workflow eventually produces a “novel insight,” I don’t see why we should attribute that insight completely to the human and not at all to the LLM; it seems more accurate to say that it was co-created by the human and the LLM, with work that normally occurs within a single human mind now divvied up between two collaborating entities.
(And if we keep attributing these insights wholly to the humans up until the point at which the LLM becomes capable of doing all the stuff the human was doing, we’ll experience this as a surprising step-change, whereas we might have been able to see it coming if we had acknowledged that the LLM was already doing a lot of what is called “having insights” when humans do it – just not doing the entirety of that process by itself, autonomously.)
If this kind of approach to mathematics research becomes mainstream, out-competing humans working alone, that would be pretty convincing. So there is nothing that disqualifies this example—it does update me slightly.
However, this example on its own seems unconvincing for a couple of reasons:
it seems that the results were in fact proven by humans first, calling into question the claim that the proof insight belonged to the LLM (even though the authors try to frame it that way).
from the reply on X it seems that the results of the paper may not have been novel? In that case, it’s hard to view this as evidence for LLMs accelerating mathematical research.
I have personally attempted to interactively use LLMs in my research process, though NOT with anything like this degree of persistence. My impression is that it becomes very easy to feel that the LLM is “almost useful” but after endless attempts it never actually becomes useful for mathematical research (it can be useful for other things like rapidly prototyping or debugging code). My suspicion is that this feeling of “almost usefulness” is an illusion; here’s a related comment from my shortform: https://www.lesswrong.com/posts/RnKmRusmFpw7MhPYw/cole-wyeth-s-shortform?commentId=Fx49CwbrH7ucmsYhD
Does this paper look more like mathematicians experimented with an LLM to try and get useful intellectual labor out of it, resulting in some curiosities but NOT accelerating their work, or does it look like they adopted the LLM for practical reasons? If it’s the former, it seems to fall under the category of proving a conjecture that was constructed for that purpose (to be proven by an LLM).