François Chollet, the creator of the Keras deep learning library, recently shared his thoughts on the limitations of LLMs in reasoning. I find his argument quite convincing and am interested to hear if anyone has a different take.
The question of whether LLMs can reason is, in many ways, the wrong question. The more interesting question is whether they are limited to memorization / interpolative retrieval, or whether they can adapt to novelty beyond what they know. (They can’t, at least until you start doing active inference, or using them in a search loop, etc.)
There are two distinct things you can call “reasoning”, and no benchmark aside from ARC-AGI makes any attempt to distinguish between the two.
First, there is memorizing & retrieving program templates to tackle known tasks, such as “solve ax+b=c”—you probably memorized the “algorithm” for finding x when you were in school. LLMs *can* do this! In fact, this is *most* of what they do. However, they are notoriously bad at it, because their memorized programs are vector functions fitted to training data, that generalize via interpolation. This is a very suboptimal approach for representing any kind of discrete symbolic program. This is why LLMs on their own still struggle with digit addition, for instance—they need to be trained on millions of examples of digit addition, but they only achieve ~70% accuracy on new numbers.
This way of doing “reasoning” is not fundamentally different from purely memorizing the answers to a set of questions (e.g. 3x+5=2, 2x+3=6, etc.) -- it’s just a higher order version of the same. It’s still memorization and retrieval—applied to templates rather than pointwise answers.
The other way you can define reasoning is as the ability to *synthesize* new programs (from existing parts) in order to solve tasks you’ve never seen before. Like, solving ax+b=c without having ever learned to do it, while only knowing about addition, subtraction, multiplication and division. That’s how you can adapt to novelty. LLMs *cannot* do this, at least not on their own. They can however be incorporated into a program search process capable of this kind of reasoning.
This second definition is by far the more valuable form of reasoning. This is the difference between the smart kids in the back of the class that aren’t paying attention but ace tests by improvisation, and the studious kids that spend their time doing homework and get medium-good grades, but are actually complete idiots that can’t deviate one bit from what they’ve memorized. Which one would you hire?
LLMs cannot do this because they are very much limited to retrieval of memorized programs. They’re static program stores. However, can display some amount of adaptability, because not only are the stored programs capable of generalization via interpolation, the *program store itself* is interpolative: you can interpolate between programs, or otherwise “move around” in continuous program space. But this only yields local generalization, not any real ability to make sense of new situations.
This is why LLMs need to be trained on enormous amounts of data: the only way to make them somewhat useful is to expose them to a *dense sampling* of absolutely everything there is to know and everything there is to do. Humans don’t work like this—even the really dumb ones are still vastly more intelligent than LLMs, despite having far less knowledge.
François Chollet on the limitations of LLMs in reasoning
Link post
François Chollet, the creator of the Keras deep learning library, recently shared his thoughts on the limitations of LLMs in reasoning. I find his argument quite convincing and am interested to hear if anyone has a different take.