Great post, thanks! I think your view is plausible, but that we should also be pretty uncertain.
Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
This has been one of my central research focuses over the past nine months or so. I very much agree that these failures should be surprising, and that understanding why is important, especially given this issue’s implications for AGI timelines. I have a few thoughts on your take (for more detail on my overall view here, see the footnoted posts[1]):
It’s very difficult to distinguish between the LLM approach (or transformer architecture) being fundamentally incapable of this sort of generalization, vs being unreliable at these sorts of tasks in a way that will continue to improve along with other capabilities. Based on the evidence we have so far, there are reasonable arguments on both sides.
But also there’s also an interesting pattern that’s emerged where people point to something LLMs fail at and say that it clearly indicates that LLMs can’t get to AGI or beyond, and then are proven wrong by the next set of LLMs a few months later. Gary Marcus provides endless examples of this pattern (eg here, here). This outside view should make us cautious about making similar predictions.
I definitely encountered that pattern myself in trying to assess this question; I pointed here to the strongest concrete challenges I found to LLM generality, and four months later LLM performance on those challenges had improved dramatically.
I do think we see some specific, critical cases that are just reliability issues, and are improving with scale (and other capabilities improvements).
Maintaining a coherent internal representation of something like a game board is a big one. LLMs do an amazing job with context and fuzziness, and struggle with state and precision. As other commenters have pointed out, this seems likely to be remediable without big breakthroughs, by providing access to more conventional computer storage and tools.
Even maintaining self-consistency over the course of a long series of interactions tends to be hard for current models, as you point out.
Search over combinatorial search trees is really hard, both because of the state/precision issues just described, and because combinatorial explosions are just hard! Unassisted humans also do pretty badly on that in the general case (although in some specific cases like chess humans learn large sets of heuristics that prune away much of the combinatorial complexity).
Backtracking in reasoning models helps with exploring multiple paths down a search tree, but maybe only by a factor of ⇐ 10.
These categories seem to have improved model-by-model in a way that makes me skeptical that it’s a fundamental block that scaling can’t solve.
A tougher question is the one you describe as “some kind of an inability to generalize”; in particular, generalizing out-of-distribution. Assessing this is complicated by a few subtleties:
Lots of test data has leaked into training data at this point[2], even if we only count unintentional leakage; just running the same exact test on system after system won’t work well.
My take is that we absolutely need dynamic / randomized evals to get around this problem.
Evaluating generalization ability is really difficult, because as far as I’ve seen, no one has a good principled way to determine what’s in and out of distribution for a model that’s absorbed a large percentage of human knowledge (I keep thinking this must be false, but no one’s yet been able to point me to a solution).
It’s further complicated by the fact that there are plenty of ways in which human intelligence fails out-of-distribution; it’s just that—almost necessarily—we don’t notice the areas where human intelligence fails badly. So lack of total generality isn’t necessarily a showstopper for attaining human-level intelligence.
I’m a lot less convinced than you seem to be that scaling has stopped bringing significant new benefits. I think that’s possible, but it’s at least equally plausible to me that
It’s just taking a lot longer to see the next full OOM of scaling, because on a linear scale that’s a lot of goddamn money. It’s hard to tell because the scaling labs are all so cagey about details. And/or
OpenAI has (as I believe I recall gwern putting it) lost the mandate of heaven. Most of their world-class researchers have decamped for elsewhere, and OpenAI is just executing on the ideas those folks had before they left. The capabilities difference between different models of the same scale is pretty dramatic, and OpenAI’s may be underperforming their scale. Again it’s hard to say.
One of my two main current projects (described here) tries to assess this better by evaluating models on their ability to experimentally figure out randomized systems (hence ~guaranteed not to be in the training data) with an unbounded solution space. We’re aiming to have a results post up by the end of May. It’s specifically motivated by trying to understand whether LLMs/LRMs can scale to/past AGI or more qualitative breakthroughs are needed first.
Thanks! I appreciate the thoughtful approach in your comment, too.
I think your view is plausible, but that we should also be pretty uncertain.
Agree.
But also there’s also an interesting pattern that’s emerged where people point to something LLMs fail at and say that it clearly indicates that LLMs can’t get to AGI or beyond, and then are proven wrong by the next set of LLMs a few months later. Gary Marcus provides endless examples of this pattern (eg here, here). This outside view should make us cautious about making similar predictions.
I agree that it should make us cautious about making such predictions, and I think that there’s an important difference between the claim I’m making and the kinds of claims that Marcus has been making.
I think the Marcus-type prediction would be to say something like “LLMs will never be able to solve the sliding square puzzle, or track the location of an item a character is carrying, or correctly write young characters”. That would indeed be easy to disprove—as soon as something like that was formulated as a goal, it could be explicitly trained into the LLMs and then we’d have LLMs doing exactly that.
Whereas my claim is “yes you can definitely train LLMs to do all those things, but I expect that they will then nonetheless continue to show puzzling deficiencies in other important tasks that they haven’t been explicitly trained to do”.
I’m a lot less convinced than you seem to be that scaling has stopped bringing significant new benefits.
Yeah I don’t have any strong theoretical reason to expect that scaling should stay stopped. That part is based purely on the empirical observation that scaling seems to have stopped for now, but for all I know, benefits from scaling could just as well continue tomorrow.
I think that there’s an important difference between the claim I’m making and the kinds of claims that Marcus has been making.
I definitely didn’t mean to sound like I was comparing your claims to Marcus’s! I didn’t take your claims that way at all (and in particular you were very clear that you weren’t putting any long-term weight on those particular cases). I’m just saying that I think our awareness of the outside view should be relatively strong in this area, because the trail of past predictions about the limits of LLMs is strewn with an unusually large number of skulls.
Yeah I don’t have any strong theoretical reason to expect that scaling should stay stopped. That part is based purely on the empirical observation that scaling seems to have stopped for now
My argument is that it’s not even clear (at least to me) that it’s stopped for now. I’m unfortunately not aware of a great site that keeps benchmarks up to date with every new model, especially not ones that attempt to graph against estimated compute—but I’ve yet to see a numerical estimate that shows capabilities-per-OOM-compute slowing down. If you’re aware of good data there, I’d love to see it! But in the meantime, the impression that scaling laws are faltering seems to be kind of vibes-based, and for the reasons I gave above I think those vibes may be off.
I’m just saying that I think our awareness of the outside view should be relatively strong in this area, because the trail of past predictions about the limits of LLMs is strewn with an unusually large number of skulls.
Right, yeah. But you could also frame it the opposite way—“LLMs are just fancy search engines that are becoming bigger and bigger, but aren’t capable of producing genuinely novel reasoning” is a claim that’s been around for as long as LLMs have. You could also say that this is the prediction that has turned out to be consistently true with each released model, and that it’s the “okay sure GPT-27 seems to suffer from this too but surely these amazing benchmark scores from GPT-28 show that we finally have something that’s not just applying increasingly sophisticated templates” predictions that have consistently been falsified. (I have at least one acquaintance who has been regularly posting these kinds of criticisms of LLMs and how he has honestly tried getting them to work for purpose X or Y but they still keep exhibiting the same types of reasoning failures as ever.)
My argument is that it’s not even clear (at least to me) that it’s stopped for now. I’m unfortunately not aware of a great site that keeps benchmarks up to date with every new model, especially not ones that attempt to graph against estimated compute—but I’ve yet to see a numerical estimate that shows capabilities-per-OOM-compute slowing down.
Fair! To me OpenAI’s recent decision to stop offering GPT-4.5 on the API feels significant, but it could be a symptom of them having “lost the mandate of heaven”. Also I have no idea of how GPT-4.1 relates to this...
Great post, thanks! I think your view is plausible, but that we should also be pretty uncertain.
This has been one of my central research focuses over the past nine months or so. I very much agree that these failures should be surprising, and that understanding why is important, especially given this issue’s implications for AGI timelines. I have a few thoughts on your take (for more detail on my overall view here, see the footnoted posts[1]):
It’s very difficult to distinguish between the LLM approach (or transformer architecture) being fundamentally incapable of this sort of generalization, vs being unreliable at these sorts of tasks in a way that will continue to improve along with other capabilities. Based on the evidence we have so far, there are reasonable arguments on both sides.
But also there’s also an interesting pattern that’s emerged where people point to something LLMs fail at and say that it clearly indicates that LLMs can’t get to AGI or beyond, and then are proven wrong by the next set of LLMs a few months later. Gary Marcus provides endless examples of this pattern (eg here, here). This outside view should make us cautious about making similar predictions.
I definitely encountered that pattern myself in trying to assess this question; I pointed here to the strongest concrete challenges I found to LLM generality, and four months later LLM performance on those challenges had improved dramatically.
I do think we see some specific, critical cases that are just reliability issues, and are improving with scale (and other capabilities improvements).
Maintaining a coherent internal representation of something like a game board is a big one. LLMs do an amazing job with context and fuzziness, and struggle with state and precision. As other commenters have pointed out, this seems likely to be remediable without big breakthroughs, by providing access to more conventional computer storage and tools.
Even maintaining self-consistency over the course of a long series of interactions tends to be hard for current models, as you point out.
Search over combinatorial search trees is really hard, both because of the state/precision issues just described, and because combinatorial explosions are just hard! Unassisted humans also do pretty badly on that in the general case (although in some specific cases like chess humans learn large sets of heuristics that prune away much of the combinatorial complexity).
Backtracking in reasoning models helps with exploring multiple paths down a search tree, but maybe only by a factor of ⇐ 10.
These categories seem to have improved model-by-model in a way that makes me skeptical that it’s a fundamental block that scaling can’t solve.
A tougher question is the one you describe as “some kind of an inability to generalize”; in particular, generalizing out-of-distribution. Assessing this is complicated by a few subtleties:
Lots of test data has leaked into training data at this point[2], even if we only count unintentional leakage; just running the same exact test on system after system won’t work well.
My take is that we absolutely need dynamic / randomized evals to get around this problem.
Evaluating generalization ability is really difficult, because as far as I’ve seen, no one has a good principled way to determine what’s in and out of distribution for a model that’s absorbed a large percentage of human knowledge (I keep thinking this must be false, but no one’s yet been able to point me to a solution).
It’s further complicated by the fact that there are plenty of ways in which human intelligence fails out-of-distribution; it’s just that—almost necessarily—we don’t notice the areas where human intelligence fails badly. So lack of total generality isn’t necessarily a showstopper for attaining human-level intelligence.
I’m a lot less convinced than you seem to be that scaling has stopped bringing significant new benefits. I think that’s possible, but it’s at least equally plausible to me that
It’s just taking a lot longer to see the next full OOM of scaling, because on a linear scale that’s a lot of goddamn money. It’s hard to tell because the scaling labs are all so cagey about details. And/or
OpenAI has (as I believe I recall gwern putting it) lost the mandate of heaven. Most of their world-class researchers have decamped for elsewhere, and OpenAI is just executing on the ideas those folks had before they left. The capabilities difference between different models of the same scale is pretty dramatic, and OpenAI’s may be underperforming their scale. Again it’s hard to say.
One of my two main current projects (described here) tries to assess this better by evaluating models on their ability to experimentally figure out randomized systems (hence ~guaranteed not to be in the training data) with an unbounded solution space. We’re aiming to have a results post up by the end of May. It’s specifically motivated by trying to understand whether LLMs/LRMs can scale to/past AGI or more qualitative breakthroughs are needed first.
I made a similar argument in “LLM Generality is a Timeline Crux”, updated my guesses somewhat based on new evidence in “LLMs Look Increasingly Like General Reasoners”, and talked about a concrete plan to address the question in “Numberwang: LLMs Doing Autonomous Research, and a Call for Input”. Most links in the comment are to one of these.
“GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models” makes this point painfully well.
Thanks! I appreciate the thoughtful approach in your comment, too.
Agree.
I agree that it should make us cautious about making such predictions, and I think that there’s an important difference between the claim I’m making and the kinds of claims that Marcus has been making.
I think the Marcus-type prediction would be to say something like “LLMs will never be able to solve the sliding square puzzle, or track the location of an item a character is carrying, or correctly write young characters”. That would indeed be easy to disprove—as soon as something like that was formulated as a goal, it could be explicitly trained into the LLMs and then we’d have LLMs doing exactly that.
Whereas my claim is “yes you can definitely train LLMs to do all those things, but I expect that they will then nonetheless continue to show puzzling deficiencies in other important tasks that they haven’t been explicitly trained to do”.
Yeah I don’t have any strong theoretical reason to expect that scaling should stay stopped. That part is based purely on the empirical observation that scaling seems to have stopped for now, but for all I know, benefits from scaling could just as well continue tomorrow.
I definitely didn’t mean to sound like I was comparing your claims to Marcus’s! I didn’t take your claims that way at all (and in particular you were very clear that you weren’t putting any long-term weight on those particular cases). I’m just saying that I think our awareness of the outside view should be relatively strong in this area, because the trail of past predictions about the limits of LLMs is strewn with an unusually large number of skulls.
My argument is that it’s not even clear (at least to me) that it’s stopped for now. I’m unfortunately not aware of a great site that keeps benchmarks up to date with every new model, especially not ones that attempt to graph against estimated compute—but I’ve yet to see a numerical estimate that shows capabilities-per-OOM-compute slowing down. If you’re aware of good data there, I’d love to see it! But in the meantime, the impression that scaling laws are faltering seems to be kind of vibes-based, and for the reasons I gave above I think those vibes may be off.
Right, yeah. But you could also frame it the opposite way—“LLMs are just fancy search engines that are becoming bigger and bigger, but aren’t capable of producing genuinely novel reasoning” is a claim that’s been around for as long as LLMs have. You could also say that this is the prediction that has turned out to be consistently true with each released model, and that it’s the “okay sure GPT-27 seems to suffer from this too but surely these amazing benchmark scores from GPT-28 show that we finally have something that’s not just applying increasingly sophisticated templates” predictions that have consistently been falsified. (I have at least one acquaintance who has been regularly posting these kinds of criticisms of LLMs and how he has honestly tried getting them to work for purpose X or Y but they still keep exhibiting the same types of reasoning failures as ever.)
Fair! To me OpenAI’s recent decision to stop offering GPT-4.5 on the API feels significant, but it could be a symptom of them having “lost the mandate of heaven”. Also I have no idea of how GPT-4.1 relates to this...
Ha, very fair point!