Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI

Introduction

Writing this post puts me in a weird epistemic position. I simultaneously believe that:

  • The reasoning failures that I’ll discuss are strong evidence that current LLM- or, more generally, transformer-based approaches won’t get us AGI without some new breakthroughs

  • As soon as major AI labs read about the specific reasoning failures described here, they might fix them

  • But future versions of GPT, Claude etc. succeeding at the tasks I’ve described here will provide zero evidence of their ability to reach AGI. If someone makes a future post where they report that they tested an LLM on all the specific things I described here it aced all of them, that will not update my position at all.

That is because all of the reasoning failures that I describe here are surprising in the sense that given everything else that they can do, you’d expect LLMs to succeed at all of these tasks. The fact that they don’t, and the specific way in which they don’t, makes me think that the problem is some kind of an inability to generalize. LLMs can do things that are sufficiently close to what they have been trained to do, and it would also be trivial to teach them all of these tasks.

But the fact that their existing training is insufficient for them to figure out these things—many of which are the kinds of things an elementary schooler would succeed at—implies that they would also be inable to deal with the kind of novelty that an AGI would by definition need to deal with.

To make it really clear: I’m not claiming that any of these would be impossible in principle for current LLMs to do. I’m certain that if the models got training on these tasks, they would have no difficulty at all with them. Rather, the claim is that if all the other training that the models have gotten doesn’t generalize to an ability to do tasks as simple as these, then it seems doubtful if this kind of an architecture is capable of handling any kind of novel reasoning.

I do think that the nature of these failures gives us some clues as to what kinds of fundamental core competencies LLMs may be missing, and I will be discussing some of those at the end of the post.

Reasoning failures

Sliding puzzle problem

(Claude Sonnet 3.7 extended thinking; DeepSeek R1; o3-mini, GPT-4.5, Grok 3)

I gave a number of models a version of this prompt:

I have the following pieces in this order:

Sun Leaf (Empty)

Bell Snake Star

(Empty) Flower (Empty)

each can be moved into an orthogonally adjacent empty square. Target:

Snake Star Flower

(Empty) Sun (Empty)

Leaf (Empty) Bell

, find the smallest number of moves for getting them from the initial configuration to the target one.

After thinking about it, with a chain-of-thought indicating that it had “double-checked” its solution multiple times, Claude 3.7 offered me a solution involving the following moves:

Claude describes move 7 as moving Snake a) left and b) to position 6. Not only are these incompatible, neither makes sense: moving Snake left would be an illegal move (it causes it to collide with Flower), and Snake is already in position 6. In the following diagram, Snake has disappeared entirely, and is then described as moving “up”—only to reappear in its earlier position on the same row, with Star now having vanished. And once Star reappears, Snake vanishes again.

I asked Claude, “what’s going on in move 7?”

Claude rechecked its reasoning, realized that on move 7, Snake was already in position 6 so that part of its response had to be wrong… and then failed to notice that there was anything else wrong with the move, giving me a “corrected” move where Snake still illegally moves left (though this time it replaces Flower rather than going somewhere behind it). The correction was that the description of the move was rephrased as “move Snake to the middle position (left)”. Meanwhile, its chain-of-thought again indicated that it had rethought the whole solution multiple times.

After I pointed this out, Claude graciously acknowledged it and asked if I wanted it to try again. I chose to not torment the poor AI any more with this.

I also tested this with DeepSeek R1, o3-mini, Grok 3, and GPT-4.5. They all ended up moving a piece diagonally at one point or the other, though to their credit, they did acknowledge that to be an illegal move when it was pointed out. Grok’s response was a bit amusing in that it analyzed the move, acknowledged that it was illegal, and then… never went back to trying to find a new solution, just contending itself with acknowledging my complaint as valid.

It was also common for them to simply declare their solution the shortest one with no real justification—for example, GPT-4.5 just moved pieces around until they were in the target configuration, and then wrote

### Minimal number of moves:

The above solution is carefully reasoned and optimized. After careful consideration, the minimal solution found is **30 moves**.

Just saying that the solution “is carefully reasoned and optimized” isn’t even trying. (An online solver found a valid solution with 15 moves.)

Oddly, if I didn’t give the models a concrete instance of the problem but first asked them to consider it abstractly, like this

Suppose that you were given the initial and target configurations for a 9x9 sliding square puzzle. Both configurations have three empty squares rather than just one. You were then told to find the smallest number of moves to get from the initial configuration to the target one. How would you go about it?

Then they would start talking about what kind of an algorithm might find an optimal solution. E.g. in response to this question, Claude told what me algorithm should be used for finding the shortest series of moves in general. Then when I gave it those same concrete configurations in my next message, it implemented and ran a piece of code that found the correct 15-move solution.

In my experience, this is a common kind of failure with LLMs—that if asked directly about how to best a solve problem, they do know the answer. But if they aren’t given that slight scaffolding, they totally fail to apply it.

Simple coaching instructions

(Claude Sonnet 3.6, 3.7, 3.7 with extended thinking, all non-reasoning GPTs before 4.5)

I gave Claude a prompt telling it to coach the user that included some basic principles that it was told to always follow, including “Ask exactly one question at a time” and “Keep all of your responses to a maximum of one paragraph”. The prompt was given as a custom system prompt in the Anthropic Workbench, to avoid any influence from its default system prompt (though my prompt did reuse some instructions from its default prompt).

As with all versions of GPT that I’ve tried to use for coaching (up to 4o), these kinds of instructions turned out to be too hard for it to consistently follow. It would break its answers into multiple paragraphs and keep asking multiple questions each time.

It’s not just Claude. I haven’t tried OpenAI’s reasoning models on this, but every version of GPT up to 4o also fails to consistently follow an instruction like “Only ask one question at a time”. They also fail to consistently follow a prompt with more emphasis on this, like

Ask exactly one question. Do not ask more than question. Stop writing your answer once it contains a question mark. NEVER ask more than one question at a time.

Believe me, I’ve tried. (I did one brief test with 4.5 and it seemed to actually follow the instruction correctly, though I would need more tries to be sure that it can follow it consistently.)

Repeatedly failing at tic-tac-toe

(Claude Sonnet 3.5, o1)

RohanS writes:

TL;DR: o1 loses the same way in tic tac toe repeatedly. [...]

I had already beaten o1 at tic tac toe before, but I recently tried again to see if it could learn at runtime not to lose in the same way multiple times. It couldn’t. I was able to play the same strategy over and over again in the same chat history and win every time. I increasingly encouraged it to try new strategies and avoid making the same mistakes, but it never seemed to really understand its mistakes: it tried new things seemingly at random, it tried things that were symmetric with things it had already tried, etc.

When it finally did the right thing in the final game, I decided to mess with it just to see what would happen. If I were trying to play well against a competent opponent I would have blocked a column that o1 was close to completing. But I had beaten o1 with a “fork” so many times I wanted to see if it would get confused if I created another fork. And it did get confused. It conceded the game, even though it was one move away from winning.

Here’s my chat transcript: https://​​chatgpt.com/​​share/​​6770c1a3-a044-800c-a8b8-d5d2959b9f65

Similar story for Claude 3.5 Sonnet, though I spent a little less time on that one.

Repeatedly offering an incorrect “fix”

(Various)

It doesn’t seem to be an uncommon experience that a model is given e.g. a piece of code with a bug in it and asked to find the bug, and then it keeps repeatedly claiming that it “found” the bug and offering revised code which doesn’t actually fix the problem. Or have anything to do with the problem, for that matter.

This doesn’t even require the code to be very complicated or the bug to be subtle and difficult; I’ve had e.g. Claude Sonnet 3.6 repeatedly fail at this on simple (a few hundred lines or less) Python scripts that it wrote itself, after which I went “fuck it” and found the bug once I bothered to spend ten minutes looking for it myself.

For one specific example, Ben Goertzel shares a story of o1 reporting that it had removed a deprecated function, only to return code that had that function intact; and then hallucinating a non-existent replacement function when this was pointed out. (This feels pretty similar in nature to some of my own experiences with LLM-aided programming.) He summarizes this as:

Given the information, from an error message, that the function ChatCompletion in the OpenAI API was deprecated, the o1 model

  1. Continued twice to keep giving me code using the function ChatCompletion, while explicitly telling me it was making code that avoided using this function

  2. Then, when I pressed it to remove the specific code snipped using the deprecated function, it hallucinated a new API function to replace the deprecated one…

This combination of lack of self-reflection /​ self-awareness with hallucination is of course characteristic of all existing LLMs … and the clear message here is that o1 does not surmount these issues in any usefully robust way, not even in pretty damn simple routine coding examples.

It seems o1 simply had not been trained on the more recent (late 2023) upgrades of the OpenAI API, which is frustrating from a practical standpoint, but tells you more about the lack of online learning and the cost of training models than the fundamental nature of transformers. However, what’s more annoying is how it reacts to this lack of knowledge: by displaying unawareness of its own ignorance and of its own current actions/​productions, and by hallucinating in a super-transparent way.

Various people’s simple tests

(various models up to newest ones)

Kelsey Piper:

I’m not a programmer so I’ve long been ill-equipped to assess Claude at the thing everyone agrees it’s best at, but I do have one personal easy capabilities test most humans can do and no model has yet passed and Sonnet 3.7 can’t do it (regular or thinking).

It’s kind of frustrating being at the point where the AIs are steadily climbing benchmarks I’d have expected to correlated with general competence but which increasingly don’t actually correlate with it

a friend with Pro ran my private reasoning test on 4.5. Fails the same way they all do.

Rafael Harth [1, 2, 3]:

About two years ago I made a set of 10 problems that imo measure progress toward AGI and decided I’d freak out if/​when LLMs solve them. [...]

You could call them logic puzzles. I do think most smart people on LW would get 1010 without too many problems, if they had enough time, although I’ve never tested this. [...] DeepSeek gets 210. [...] o3-mini-high gets 310

Various failures at logic and consistency while writing fiction

A few interesting failures I’ve noticed when doing fiction-writing or similar with Claude. I think these are pointing to something particularly interesting.

Inability to write young characters when first prompted

I wanted to see how Claude would handle writing a very young (2-3 years old) character. To help scaffold it, I started by asking what kind of internal narration and behavior would be realistic for such a young character and how they should be written, then guided it to write the character.

Despite this attempt at scaffolding, Claude ended up writing the character with higher levels of metacognition and complexity of thought than realistic. After getting its output, I opened a new chat and in that window, asked it to analyze what it had just written and evaluate whether the writing matched the character’s intended age. Depending on exactly which scenes I asked it to analyze, the age that it thought the character’s development implied was anywhere between 4 and 7.

Notably, when it was given the story and told to rewrite it to make the child’s internal narration more age-appropriate, it had no problem with this.

It felt strongly like it had a particular conception of “how to write children” that it fell back into because that was very common in its training data, with very few stories trying to accurately portray a toddler’s inner life. It was as if there were features associated with writing a new story that were also associated with a relatively high level of metacognition and those got activated together, overriding the previous guidance toward writing age-appropriate prose. Later, when the task was simplified to “rewrite an existing narration”, it could do that without problem because it didn’t activate features that would be pushing it in a different direction.

Some of the following logic failures also feel like instances of stereotypical features getting automatically co-activated in ways that prevent the model from doing proper reasoning.

Paranormal posers

In one story, I had two agents of the Paranormal Investigations Department, a fictional federal agency. I was brainstorming story ideas with Claude, and wrote the following:

Let’s think of some concrete mission they’re involved in. Let’s say that there’s a large hospital complex they suspect is infested with hostile spirits. The staff there is nominally happy to cooperate with PID, they’re law enforcement after all, but in practice many of them are busy and/​or either possessed themselves or manipulated by people who are.

Claude’s suggestion included the idea that the two agents “might pose as health & safety inspectors to access restricted areas.”

I pointed out that this doesn’t make much sense when we are talking about a pair of agents from a federal agency with law enforcement powers; why would they pose as inspectors from a different agency, when they could just invoke their own authority? Claude immediately backtracked and explained that it had been too “caught up in classic ‘undercover’ tropes”.

Global details replacing local ones

In another story, I had a sci-fi setting with a planet where people had achieved indefinite lifespans and the typical person alive was hundreds if not thousands of years old. I had discussed the sociology of this world with Claude before we started writing. Then I wanted to explore what it would be like for somebody to be young in a world full of people that old, so I started writing a scene where I explicitly mentioned that the character’s young age—the exact age varied between 14 and early twenties in different versions of the story.

Repeatedly, Claude would end up referring to the character having “decades” or “centuries” of experience with something when continuing the scene/​story, falling back to the general description of what people in this world are like and forgetting what age this specific character was. These were mostly also not very long stories where one would think it natural for the model to lose track of details like that. E.g. in one of them, the conversation was about 5000 tokens long when the first such slip-up happened, with about 1000 tokens between that and the mention of the character’s age.

Stereotyped behaviors replacing character-specific ones

In yet another story, I had a character who was a bit of an outsider. Not only did she behave in ways that were a little socially oblivious, she didn’t even know that there was anything unusual about behaving in that way. (The full details are a little complicated to explain.) This was a very central thing about her character, one that I had explicitly discussed with Claude several times.

Then when I had Claude write a scene where this character witnessed another also acting in a similar oblivious way, Claude wrote the character as being surprised to see someone else acting like her—even though we had established multiple times that someone else exhibiting the same behavior would be totally unremarkable to this character, as she’d automatically assume that everyone is like this.

It seemed that, even though we had repeatedly discussed the fact that this character doesn’t understand themselves being different, Claude couldn’t help but fall back into applying the default assumption of everyone understanding that this behavior is strange.

Top secret marine databases

The role-playing game Delta Green has player characters who work for various US federal agencies. For a campaign, I wanted a character associated with a less-known (but still real) federal agency, and picked the US Postal Inspection Service. Delta Green has details about various agencies but not USPIS; so I took some of the game’s existing info boxes (below) and had Claude generate a similar one for an USPIS agent.

Then for fun, I also had Claude generate an info box for a NOAA Fisheries Law Enforcement agent. That box contained this interesting line:

Accessing NOAA’s scientific databases, including protected marine species tracking data, fisheries quotas, and marine ecosystem research, is equivalent to a Standard expense and rarely triggers review unless accessed repeatedly without clear investigative purpose.

As I pointed out to Claude—would accessing scientific databases about marine species by a law enforcement agent really be subject to review, even if done “without clear investigative purpose”? It seemed like Claude had just looked at the other examples about agents needing a reason to access things like contraband, and applied that to the instance of accessing civilian databases without thinking about whether that makes any sense.

Wandering items

In one story I was co-writing with Claude, a character had an item in her jacket pocket. Then she was described as leaving the jacket in a closet, leaving the item in the pocket. Later in the scene, Claude wrote her as touching the item in her pocket, as if she was still supposed to have it on her.

Sycophancy

It’s well-known that LLMs may completely flip their position on their issue depending on what the user seems to want. For example, I may get completely different outputs depending on whether I tell it “this character’s behavior is unrealistic, don’t you agree” or “this is a very realistic depiction of this character’s behavior”.

I think this level of sycophancy is by itself indicative of a reasoning failure. Some of it of course comes from RLHF guiding LLMs toward pleasing the users, but it also seems like this is symptomatic of the models not having a strong enough ability to reason about the domain in question.

There are two kinds of reasoning—one based on physical reality and the other based on social reality. If your models of physical reality are weak enough that you are equally able to explain any outcome, then social reality (i.e. saying what pleases the user) becomes your only guide for what’s true.

In line with this, Claude is more likely to give a sycophantic answer when it’s asked a math question it can’t solve versus a problem it can.

What’s going on here?

There seem to be multiple different issues:

  • Applying stereotyped patterns that don’t make sense in context (e.g. writing a toddler as having the cognitive capacities of an older child)

  • Not knowing which pattern to apply (e.g. trying to just brute-force any solution to a sliding square puzzle and then declaring that the best one, instead of first figuring out an algorithm for finding the shortest solution)

  • Entirely lacking an appropriate pattern in the first place

  • Some kind of failure of spatial reasoning (wandering items, whatever was going on with some of the sliding square chain-of-thoughts where pieces vanished)

There’s a thing that sometimes happens to me, where I obviously know something, but then an old way of reacting is so strongly cached that it temporarily overrides my knowledge. For example, one time the electricity was out, and I thought to myself something along the lines of “well no matter, if the electricity is out then I’ll just play some video game until- oh.”

Some of the fiction-writing examples feel to me a lot like that. Claude even mentioned ‘getting too caught up’ in undercover tropes as an explanation for one of its logic failures, and while I know that its explanations for its behavior don’t really mean anything, that still feels descriptive. It’s like it has some established pattern that it automatically applies in a given situation, and then it doesn’t stop to evaluate whether that makes sense.

Or possibly it does notice that the pattern doesn’t make sense, but still fails to act accordingly. I can’t help but to be reminded of the recent analysis of what makes some jailbreaks work. Once Claude has started writing a sentence like “to make a bomb”, features relating to refusals become active, and it’s as if it “recognizes at this point that it should refuse”. But it is constrained by the need to make the sentence self-consistent and grammatical and there’s no way for it to veer into a refusal without violating those constraints, so it finishes the sentence. In a sense, it is “too caught up” in the sentence that it started writing that it has to finish it before it can refuse.

If a model has some pattern that it can apply to fix the problem once it’s been pointed out (as in the fictional examples), it does that; if it doesn’t have any pattern, then it will still try to do something, but applying that may not do anything (as in making pointless “fixes” that fail to notice the core problem in code).

These “patterns” feel like they’d be similar to what @Thane Ruthenis calls “templates”:

[Humans s]tare at the actual problem they’re solving, and build its highly detailed model in a “bottom-up” manner. Instead, LLMs go “top-down”: they retrieve the closest-match template problem from a vast database, fill-in some details, and solve that problem. [...]

My model is that all LLM progress so far has involved making LLMs better at the “top-down” thing. They end up with increasingly bigger databases of template problems, the closest-match templates end up ever-closer to the actual problems they’re facing, their ability to fill-in the details becomes ever-richer, etc. This improves their zero-shot skills, and test-time compute scaling allows them to “feel out” the problem’s shape over an extended period and find an ever-more-detailed top-down fit.

But it’s still fundamentally not what humans do. Humans are able to instantiate a completely new abstract model of a problem – even if it’s initially based on a stored template – and chisel at it until it matches the actual problem near-perfectly. This allows them to be much more reliable; this allows them to keep themselves on-track; this allows them to find “genuinely new” innovations.

The two methods do ultimately converge to the same end result: in the limit of a sufficiently expressive template-database, LLMs would be able to attain the same level of reliability/​problem-representation-accuracy as humans. But the top-down method of approaching this limit seems ruinously computationally inefficient; perhaps so inefficient it saturates around GPT-4′s capability level.

[...]

(As an abstract analogy: imagine that you need to color the space bounded by some 2D curve. In one case, you can take a pencil and do it directly. In another case, you have a collection of cutouts of geometric figures, and you have to fill the area by assembling a collage. If you have a sufficiently rich collection of figures, you can come arbitrarily close; but the “bottom-up” approach is strictly better. In particular, it can handle arbitrarily complicated shapes out-of-the-box, whereas the second approach would require dramatically bigger collections the more complicated the shapes get.)

So for example, an LLM sees that:

  • There is a coach who is guiding the client through something. Well, the default template for these kinds of conversations involves asking several questions at once, so let’s use that one! (Even if the instructions say otherwise.)

  • The story has a very young character who’s shown as doing things, and we want to hear about her internal experience. Well, the default template for writing people shows them as having certain capabilities, so let’s have her have those! (Even if those capabilities would be way too advanced for her age.)

  • There are characters in a story that need to investigate a hospital where the staff aren’t fully cooperative. Well, one template for solving this problem is that they’ll pose as health inspectors! (Even if that makes no logical sense.)

  • There is a piece of code that is not functioning. Well, one template for solving this problem is to keep making random changes until something works! (But the specific templates being applied don’t involve checking whether you’ve actually removed the deprecated function you said you’d remove.)

The funny thing is that it would be trivial to fine-tune the relevant LLMs to catch these specific problems. It would require just a bit of training to get them to consistently follow instructions about only asking one question a time, to have law enforcement agents investigating a haunted hospital to use their legal authority to get access, and to remove the ChatCompletion function. But there seems to be something about it that fails to generalize—surely all the models must have been trained on lots of similar tasks already.

There seems to be some quality of actually looking at the problem that current LLMs are failing to exhibit, that would be necessary for good performance on novel tasks. This causes their performance on various benchmarks to give a misleadingly good impression of their intelligence—once people set a target on a benchmark, LLMs can be trained to perform well on that particular benchmark, while still failing exactly in the same ways on things that they have not been trained on.

This is, of course, a fatal flaw in any system with aspirations for AGI-hood. Being a general intelligence means being able to recognize when a situation or problem is novel and adapting your behavior to that. If a system runs into a situation that vaguely resembles one it has encountered before and just applies a stereotyped behavior without noticing how the situation is different, it will inevitably fail.

As a result, my expectation is that we’ll continue seeing models do the same as they’ve been doing so far. That is, continue getting increasingly impressive scores on various benchmarks or on math/​programming/​logic tasks where lots of examples can be automatically generated and verified, while failing to produce much in the way of novel results or generalizing outside those domains.

How about scaling? Or reasoning models?

In my subjective experience, more recent models have gotten better at not running into these kinds of failures. Xie et al. (2024) write that:

We find that LLMs could interpolate and memorize the training puzzles (achieving near-perfect accuracy) after fine-tuning, yet they struggle with slight variations of these puzzles. On the other hand, we show that while fine-tuning leads to heavy memorization, it also consistently improves generalization performance. [...] we establish that LLMs develop reasoning skills on K&K puzzles alongside memorization.

Based on just the abstract, this matches my experience: on the other hand, there seems to be a weird fragility where LLMs fail to solve novel problems that you’d expect them to be able to solve, if those problems are just slightly off from what the LLMs have been trained to solve. On the other hand, recent LLMs are clearly better at generalizing and do less of this kind of thing than some earlier models did.

So if continued model scaling had continued to bring benefits, I might have expected that these issues would eventually get fixed through just that. But the fact that e.g. GPT-4.5 was disappointing and that larger models have generally not brought expected improvements, implies that further scaling won’t significantly help with generalization. (This is a crux for me: if scaling got back on track on producing improved generalization at the rate it used to, then I would update toward this possibly being able to take us to AGI.)

This would imply that short timeline predictions that involve LLMs making it to the ASI level in with just a few years of extra progress don’t seem plausible. Rather, Thane’s bear case seems to me like the most plausible near-term prediction. (Of course, it’s always possible that there’s some qualitative breakthrough that gives LLMs novel capabilities, but most claims of timelines of five years or less seem to be assuming that we don’t need such breakthroughs.)

What about reasoning models?

I have about 70% confidence that those won’t change the overall picture. Reasoning models are able to do longer chains of reasoning. But just doing a longer reasoning chain doesn’t help if you are taking your reasoning in the wrong direction, don’t know what to do, or are committing illegal steps.

[EDIT: After some discussion in the comments, I’m much more less about this, confidence down to something like 30% at most. See here.]

Take the instance of Claude “double-checking” its solutions multiple times, and then still producing flawed results. There’s no doubt that you can train it to avoid that specific error. But if it lacks the cognitive machinery that would let it generalize to catching all errors like that, then longer chains of reasoning merely means lots of opportunities to make errors it hasn’t been trained to avoid and that it fails to catch.

And even if reasoning models did make progress, they would still mostly only make that progress in domains where correct answers were easy to test for.

Reasoning models might help in domains where the templates they’ve been trained on closely match the problems they encounter. They may develop better meta-strategies for combining templates, or for checking their own answers. But it’s not clear to me that any of that would help with the fundamental problem.