eggsyntax comments on Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI

eggsyntax Apr 21, 2025, 8:39 PM
6 points
0
I think that there’s an important difference between the claim I’m making and the kinds of claims that Marcus has been making.
I definitely didn’t mean to sound like I was comparing your claims to Marcus’s! I didn’t take your claims that way at all (and in particular you were very clear that you weren’t putting any long-term weight on those particular cases). I’m just saying that I think our awareness of the outside view should be relatively strong in this area, because the trail of past predictions about the limits of LLMs is strewn with an unusually large number of skulls.
Yeah I don’t have any strong theoretical reason to expect that scaling should stay stopped. That part is based purely on the empirical observation that scaling seems to have stopped for now
My argument is that it’s not even clear (at least to me) that it’s stopped for now. I’m unfortunately not aware of a great site that keeps benchmarks up to date with every new model, especially not ones that attempt to graph against estimated compute—but I’ve yet to see a numerical estimate that shows capabilities-per-OOM-compute slowing down. If you’re aware of good data there, I’d love to see it! But in the meantime, the impression that scaling laws are faltering seems to be kind of vibes-based, and for the reasons I gave above I think those vibes may be off.
- Kaj_Sotala Apr 22, 2025, 2:50 PM
  4 points
  0
  Parent
  I’m just saying that I think our awareness of the outside view should be relatively strong in this area, because the trail of past predictions about the limits of LLMs is strewn with an unusually large number of skulls.
  Right, yeah. But you could also frame it the opposite way—“LLMs are just fancy search engines that are becoming bigger and bigger, but aren’t capable of producing genuinely novel reasoning” is a claim that’s been around for as long as LLMs have. You could also say that this is the prediction that has turned out to be consistently true with each released model, and that it’s the “okay sure GPT-27 seems to suffer from this too but surely these amazing benchmark scores from GPT-28 show that we finally have something that’s not just applying increasingly sophisticated templates” predictions that have consistently been falsified. (I have at least one acquaintance who has been regularly posting these kinds of criticisms of LLMs and how he has honestly tried getting them to work for purpose X or Y but they still keep exhibiting the same types of reasoning failures as ever.)
  My argument is that it’s not even clear (at least to me) that it’s stopped for now. I’m unfortunately not aware of a great site that keeps benchmarks up to date with every new model, especially not ones that attempt to graph against estimated compute—but I’ve yet to see a numerical estimate that shows capabilities-per-OOM-compute slowing down.
  Fair! To me OpenAI’s recent decision to stop offering GPT-4.5 on the API feels significant, but it could be a symptom of them having “lost the mandate of heaven”. Also I have no idea of how GPT-4.1 relates to this...
  - eggsyntax Apr 22, 2025, 3:45 PM
    4 points
    0
    Parent
    Right, yeah. But you could also frame it the opposite way
    Ha, very fair point!