Computational Complexity as an Intuition Pump for LLM Generality
With sufficient scale and scaffolding, LLMs will improve without bound on all tasks to become superhuman AGI, to the extent they haven’t already. No, wait! LLMs are dead-end pattern-matching machines fundamentally incapable of general reasoning and novel problem solving. Which is it?
I’ll call these opposing points of view “LLM Scaler” and “LLM Skeptic”. The first appears to be held by the big AGI labs and was recently exemplified by Leopold Aschenbrenner’s Situational Awareness series, while the second is influenced by cognitive science and can be associated with researchers such as François Chollet, Gary Marcus, and Melanie Mitchell. This post loosely generalizes these two stances, so any misrepresentation or conflation of individuals’ viewpoints is my own. I recommend Egg Syntax’s recent post for further introduction and a summary of relevant research. We can caricature the debate something like this:
LLM Scaler: LLMs will take us to AGI. See straight lines on graphs. See correct answers to hard questions on arbitrary topics. See real customers paying real money for real value.
LLM Skeptic: LLMs achieve high skill by memorizing patterns inherent in their giant training set. This is not the path to general intelligence. For example, LLMs will never solve Task X.
LLM Scaler: They solved it.
LLM Skeptic: No kidding? That doesn’t count then, but they’ll really never solve Task Y. And you can’t just keep scaling. That will cost trillions of dollars.
LLM Scaler: Do you want to see my pitch deck?
Meanwhile, on Alternate Earth, Silicon Valley is abuzz over recent progress by Large List Manipulators (LLMs), which sort a list by iteratively inserting each item into its correct location. Startups scramble to secure special-purpose hardware for speeding up their LLMs.
LLM Scaler: LLMs are general list sorters, and will scale to sort lists of any size. Sure, we don’t quite understand how they work, but our empirical compute-optimal scaling law (N ~ C^0.5) has already held across a dozen OOMs, and we’re spending billions in venture capital to keep it going!
LLM Skeptic: That absurd expense is unsustainable. Can’t you see that? There’s no way that LLMs are truly general list sorters. Good luck getting one to sort a list with a million items.
LLM Scaler: We already have.
LLM Skeptic: Oh. Well then, LLMs will never sort a list with a BILLION items!
The “LLM” Skeptic is, literally, wrong. Insertion sort is fully general and can in principle scale to sort a list of any size. Each time the Skeptic declares that some specific list size is impossible to sort, they only embarrass themselves. But this skepticism reflects a deeper truth. The “LLM” paradigm is fundamentally inefficient. Sooner or later, hype will crash into real-world cost constraints and progress will stall. If Alternate Earth knew more computer science, the Skeptic would have told the Scaler to replace O(N^2) insertion sort with an efficient O(N log N) algorithm like quicksort.
Returning to Large Language Models, how might computational complexity reframe our understanding of their abilities? The explicit or implicit designer of any intelligent system faces a tradeoff between allocating its model capacity into memorization – of facts, heuristics, or complex programmed behaviors – and implementing adaptive algorithms for learning or optimization. Both strategies enable equivalent behavior given unlimited resources. However, the former strategy requires model capacity that scales with the diversity and complexity of possible tasks, and the real world is both diverse and complex. The latter strategy requires only constant capacity, yielding drastically improved efficiency. This is made possible by exploiting some combination of in-context data, computation time, or external memory during inference.
An efficient learning algorithm makes the most accurate predictions possible, while using the fewest resources possible. Scaling up an LLM requires increasing the model size, training dataset size, and compute (proportional to the first two’s product) in tandem following some optimal ratio, limited by one of these three factors. For a chosen scaling policy, an LLM’s computational complexity in model capacity translates into efficiency using its training resources in general.
With this mental model, we can understand the LLM Skeptic as making either the strong claim that LLMs memorize exclusively, or the weaker but more believable claim that they memorize excessively, leavened with only a little in-context learning. In short, LLMs are inefficient algorithms. The Skeptic would be wrong to pinpoint any given task as impossible in principle, if given unlimited parameters, training data, and compute. But they could be right that in practice inefficiency pushes generality out of reach. The LLM Scaler might variously reply that whatever way LLMs balance memorization and in-context learning is apparently good enough; that scaffolding will patch any inefficiencies; or that more efficient in-context learning strategies will keep emerging with scale.
Will scaling up LLMs lead to AGI? And if not, what will? You can scale, but you can’t scale forever. The ultimate impact of current and future AI systems is bounded by the maximum efficiency with which they convert resources into solved problems. To understand their capabilities, we need to quantify this efficiency.
I think this fits your framing and I left it out of my long comment on Egg’s other excellent post on LLM generality: LLM agents don’t have to be that good at full generality to outperform people. I don’t think humans truly do that well at it either.
Limited generality can cover arbitrarily huge portions of task space. And we’re not going for most efficient algorithm here, we’re going for first available route to exceed human capabilities within reasonable per-cognition end-user budgets.
To outmatch humans in every way and pose an x-risk, they’ve got to advance toward full generality as humans do, by solving totally novel problems when necessary.
I think humans almost never do true out-of-distribution generalization. We learn and deploy abstract concepts that make what was formerly out-of-training-set into within-training-set. Usually, we learn that concept from another human. Once in a while, we derive our own new abstract concepts.
LLMs can’t do this yet. But it might not be hard to scaffold them to be as good at it as humans are, because we’re not as good as we’d like to imagine.
Pattern-matching reasoning using problem-solving formulas we’ve learned from others covers the vast majority of important tasks in the current world. So even if they’re not fully general, LLMs might exceed human capabilities in most tasks. And they might be scaffolded to derive and deploy new concepts as well or better than humans. We don’t actually succeed at it often.
I think humans only do this, come up with genuinely new concepts and thereby reason in totally novel domains or achieve full generality, maybe a few times in a lifetime, and certainly not daily. We did not understand Newtonian physics easily, despite the readily available data, nor quantum physics once the relevant data was available. If you watch a young child work to match shapes to holes in those toys, you will be either horrified or amused. It takes them a very long time (weeks, not hours) to understand what looks drop-dead simple to us, because we’ve already learned the relevant problem-solving algorithm.
We’re not as smart as we’d like to think. We can arrive at genuine new insights, but the process is clumsy. We make lots of wrong guesses at useful new concepts, and they’re almost always recombinations of old concepts (and thereby probably describable in language—not that foundation model agents are strictly limited to that sort).
What we do better than LLMs is test our clumsy guesses against data. But that cognitive process might be drop-dead easy to add with scaffolding.
And if that turns out to not be easy, what about asking a human for a hand with the few novel problems for which strategies aren’t available in some writing? Solving 99% of the problems might result in a nearly 100x speedup in productivity, including in AGI research.
For other reasons, some given in my response to Egg’s framing, others in my tangenting deleted comment (which I’ll try to turn into a quick take since it got off topic for your excellent question), I think language model cognitive architectures are quite likely to achieve full AGI. That’s a very different but actually less scary prospect than them not achieving competent autonomous AGI but speeding up AGI research by 100x within a few years.
If that happens, we’ll get a different sort of AGI that’s probably not going to have translucent thoughts or act and think based on core goals that we gave it in nice English sentences. Those advantages aren’t a guaranteed win, but they seem like huge advantages over trying to align AGI without those properties. So I’m leaning toward hoping language models make it, even though that has likely faster timelines.
The scaler view is not that LLMs scale to superintelligence directly and without bound, but that they merely scale enough to start fixing their remaining crippling flaws and assisting with further scaling, which due to their digital and high speed nature massively accelerates timelines compared to only human labor. So the crux is a salient relevant threshold that’s relatively low, though it might still prove too high without further advances.
To do that and achieve something looking like take-off they would need to have to get to the level of advanced AI researcher, rather than just coding assistant. That is come up with novel architectures to test. Even if the LLM could write all the code for a top researcher 10* faster that’s not a 10* speedup in timelines, probably 50% at most if much of the time is thinking up theoretical concepts and waiting for training runs to test results.
An LLM might be able to take a few steps of advanced research (though not necessarily more than that) into many current topics (at once) if it was pre-trained on the right synthetic data. Continual improvement through search/self-play also seems to be getting closer.
Even without autonomous research, another round of scaling (that’s currently unaffordable) gets unlocked by economic value of becoming able to do routine long-horizon tasks. The question is always the least possible capability sufficient to keep the avalanche going.
I am clearly in the skeptic camp, in the sense that I don’t believe the current architecture will get to AGI with our resources. That is if all the GPU, training data in the world where used it wouldn’t be sufficient and maybe no amount of compute/data would.
To me the strongest evidence that our architecture doesn’t learn and generalize well isn’t LLM but in fact Tesla autopilot. It has ~10,000* more training data than a person, much more FLOPS and is still not human level. I think Tesla is doing pretty much everything major right with their training setup. Our current AI setups just don’t learn or generalize as well as the human brain and similar. They don’t extract symbols or diverse generalizations from high bandwidth un-curated data like video. Scaffolding doesn’t change this.
A medium term but IMO pretty much guaranteed way to get this would be to study and fully characterize the cortical column in the human/mammal brain.
I am not an expert; I don’t know how exactly the LLMs improve depending on their inputs. But if we take the amount of energy “all output of humanity, maybe add an order of magnitude or two for inventions and engineering in near future” and the amount of human text “everything humanity ever wrote, or even said, add an order of magnitude because people will keep talking”, well, if we won’t get something game-changing out of this, then it seems we are stuck—where would we get the additional order of magnitude inputs from?
So far the scaling worked, because we could redirect more and more resources from other parts of economy towards LLMs. How much time is left until a significant fraction of the world economy is spent training new version of LLMs, and what happens then? My naive assumption is that the requirements of LLMs grow exponentially, can the economy soon start e.g. doubling every year to match that? If not, then the training needs of LLMs will outrun the economy, and the progress slows down.