LLM Generality is a Timeline Crux
Four-Month Update
[EDIT: I believe that this paper looking at o1-preview, which gets much better results on both blocksworld and obfuscated blocksworld, should update us significantly toward LLMs being capable of general reasoning. See update post here.]
Short Summary
LLMs may be fundamentally incapable of fully general reasoning, and if so, short timelines are less plausible.
Longer summary
There is ML research suggesting that LLMs fail badly on attempts at general reasoning, such as planning problems, scheduling, and attempts to solve novel visual puzzles. This post provides a brief introduction to that research, and asks:
Whether this limitation is illusory or actually exists.
If it exists, whether it will be solved by scaling or is a problem fundamental to LLMs.
If fundamental, whether it can be overcome by scaffolding & tooling.
If this is a real and fundamental limitation that can’t be fully overcome by scaffolding, we should be skeptical of arguments like Leopold Aschenbrenner’s (in his recent ‘Situational Awareness’) that we can just ‘follow straight lines on graphs’ and expect AGI in the next few years.
Introduction
Leopold Aschenbrenner’s recent ‘Situational Awareness’ document has gotten considerable attention in the safety & alignment community. Aschenbrenner argues that we should expect current systems to reach human-level given further scaling[1], and that it’s ‘strikingly plausible’ that we’ll see ‘drop-in remote workers’ capable of doing the work of an AI researcher or engineer by 2027. Others hold similar views.
Francois Chollet and Mike Knoop’s new $500,000 prize for beating the ARC benchmark has also gotten considerable recent attention in AIS[2]. Chollet holds a diametrically opposed view: that the current LLM approach is fundamentally incapable of general reasoning, and hence incapable of solving novel problems. We only imagine that LLMs can reason, Chollet argues, because they’ve seen such a vast wealth of problems that they can pattern-match against. But LLMs, even if scaled much further, will never be able to do the work of AI researchers.
It would be quite valuable to have a thorough analysis of this question through the lens of AI safety and alignment. This post is not that[3], nor is it a review of the voluminous literature on this debate (from outside the AIS community). It attempts to briefly introduce the disagreement, some evidence on each side, and the impact on timelines.
What is general reasoning?
Part of what makes this issue contentious is that there’s not a widely shared definition of ‘general reasoning’, and in fact various discussions of this use various terms. By ‘general reasoning’, I mean to capture two things. First, the ability to think carefully and precisely, step by step. Second, the ability to apply that sort of thinking in novel situations[4].
Terminology is inconsistent between authors on this subject; some call this ‘system II thinking’; some ‘reasoning’; some ‘planning’ (mainly for the first half of the definition); Chollet just talks about ‘intelligence’ (mainly for the second half).
This issue is further complicated by the fact that humans aren’t fully general reasoners without tool support either. For example, seven-dimensional tic-tac-toe is a simple and easily defined system, but incredibly difficult for humans to play mentally without extensive training and/or tool support. Generalizations that are in-distribution for humans seems like something that any system should be able to do; generalizations that are out-of-distribution for humans don’t feel as though they ought to count.
How general are LLMs?
It’s important to clarify that this is very much a matter of degree. Nearly everyone was surprised by the degree to which the last generation of state-of-the-art LLMs like GPT-3 generalized; for example, no one I know of predicted that LLMs trained on primarily English-language sources would be able to do translation between languages. Some in the field argued as recently as 2020 that no pure LLM would ever able to correctly complete Three plus five equals. The question is how general they are.
Certainly state-of-the-art LLMs do an enormous number of tasks that, from a user perspective, count as general reasoning. They can handle plenty of mathematical and scientific problems; they can write decent code; they can certainly hold coherent conversations.; they can answer many counterfactual questions; they even predict Supreme Court decisions pretty well. What are we even talking about when we question how general they are?
The surprising thing we find when we look carefully is that they fail pretty badly when we ask them to do certain sorts of reasoning tasks, such as planning problems, that would be fairly straightforward for humans. If in fact they were capable of general reasoning, we wouldn’t expect these sorts of problems to present a challenge. Therefore it may be that all their apparent successes at reasoning tasks are in fact simple extensions of examples they’ve seen in their truly vast corpus of training data. It’s hard to internalize just how many examples they’ve actually seen; one way to think about it is that they’ve absorbed nearly all of human knowledge.
The weakman version of this argument is the Stochastic Parrot claim, that LLMs are executing relatively shallow statistical inference on an extremely complex training distribution, ie that they’re “a blurry JPEG of the web” (Ted Chiang). This view seems obviously false at this point (given that, for example, LLMs appear to build world models), but assuming that LLMs are fully general may be an overcorrection.
Note that this is different from the (also very interesting) question of what LLMs, or the transformer architecture, are capable of accomplishing in a single forward pass. Here we’re talking about what they can do under typical auto-regressive conditions like chat.
Evidence for generality
I take this to be most people’s default view, and won’t spend much time making the case. GPT-4 and Claude 3 Opus seem obviously be capable of general reasoning. You can find places where they hallucinate, but it’s relatively hard to find cases in most people’s day-to-day use where their reasoning is just wrong. But if you want to see the case made explicitly, see for example “Sparks of AGI” (from Microsoft, on GPT-4) or recent models’ performance on benchmarks like MATH which are intended to judge reasoning ability.
Further, there’s been a recurring pattern (eg in much of Gary Marcus’s writing) of people claiming that LLMs can never do X, only to be promptly proven wrong when the next version comes out. By default we should probably be skeptical of such claims.
One other thing worth noting is that we know from ‘The Expressive Power of Transformers with Chain of Thought’ that the transformer architecture is capable of general reasoning under autoregressive conditions. That doesn’t mean LLMs trained on next-token prediction learn general reasoning, but it means that we can’t just rule it out as impossible. [EDIT 10/2024: a new paper, ‘Autoregressive Large Language Models are Computationally Universal’, makes this even clearer, and furthermore demonstrates that it’s true of LLMs in particular].
Evidence against generality
The literature here is quite extensive, and I haven’t reviewed it all. Here are three examples that I personally find most compelling. For a broader and deeper review, see “A Survey of Reasoning with Foundation Models”.
Block world
All LLMs to date fail rather badly at classic problems of rearranging colored blocks. We do see improvement with scale here, but if these problems are obfuscated, performance of even the biggest LLMs drops to almost nothing[5].
Scheduling
LLMs currently do badly at planning trips or scheduling meetings between people with availability constraints [a commenter points out that this paper has quite a few errors, so it should likely be treated with skepticism].
ARC-AGI
Current LLMs do quite badly on the ARC visual puzzles, which are reasonably easy for smart humans.
Will scaling solve this problem?
The evidence on this is somewhat mixed. Evidence that it will includes LLMs doing better on many of these tasks as they scale. The strongest evidence that it won’t is that LLMs still fail miserably on block world problems once you obfuscate the problems (to eliminate the possibility that larger LLMs only do better because they have a larger set of examples to draw from)[5].
One argument made by Sholto Douglas and Trenton Bricken (in a discussion with Dwarkesh Patel) is that this is a simple matter of reliability—given a 5% failure rate, an AI will most often fail to successfully execute a task that requires 15 correct steps. If that’s the case, we have every reason to believe that further scaling will solve the problem.
Will scaffolding or tooling solve this problem?
This is another open question. It seems natural to expect that LLMs could be used as part of scaffolded systems that include other tools optimized for handling general reasoning (eg classic planners like STRIPS), or LLMs can be given access to tools (eg code sandboxes) that they can use to overcome these problems. Ryan Greenblatt’s new work on getting very good results on ARC with GPT-4o + a Python interpreter provides some evidence for this.
On the other hand, a year ago many expected scaffolds like AutoGPT and BabyAGI to result in effective LLM-based agents, and many startups have been pushing in that direction; so far results have been underwhelming. Difficulty with planning and novelty seems like the most plausible explanation.
Even if tooling is sufficient to overcome this problem, outcomes depend heavily on the level of integration and performance. Currently for an LLM to make use of a tool, it has to use a substantial number of forward passes to describe the call to the tool, wait for the tool to execute, and then parse the response. If this remains true, then it puts substantial constraints on how heavily LLMs can rely on tools without being too slow to be useful[6]. If, on the other hand, such tools can be more deeply integrated, this may no longer apply. Of course, even if it’s slow there are some problems where it’s worth spending a large amount of time, eg novel research. But it does seem like the path ahead looks somewhat different if system II thinking remains necessarily slow & external.
Why does this matter?
The main reason that this is important from a safety perspective is that it seems likely to significantly impact timelines. If LLMs are fundamentally incapable of certain kinds of reasoning, and scale won’t solve this (at least in the next couple of orders of magnitude), and scaffolding doesn’t adequately work around it, then we’re at least one significant breakthrough away from dangerous AGI—it’s pretty hard to imagine an AI system executing a coup if it can’t successfully schedule a meeting with several of its co-conspirator instances.
If, on the other hand, there is no fundamental blocker to LLMs being able to do general reasoning, then Aschenbrenner’s argument starts to be much more plausible, that another couple of orders of magnitude can get us to the drop-in AI researcher, and once that happens, further progress seems likely to move very fast indeed.
So this is an area worth keeping a close eye on. I think that progress on the ARC prize will tell us a lot, now that there’s half a million dollars motivating people to try for it. I also think the next generation of frontier LLMs will be highly informative—it’s plausible that GPT-4 is just on the edge of being able to effectively do multi-step general reasoning, and if so we should expect GPT-5 to be substantially better at it (whereas if GPT-5 doesn’t show much improvement in this area, arguments like Chollet’s and Kambhampati’s are strengthened).
OK, but what do you think?
[EDIT: see update post for revised versions of these estimates]
I genuinely don’t know! It’s one of the most interesting and important open questions about the current state of AI. My best guesses are:
LLMs continue to do better at block world and ARC as they scale: 75%
LLMs entirely on their own reach the grand prize mark on the ARC prize (solving 85% of problems on the open leaderboard) before hybrid approaches like Ryan’s: 10%
Scaffolding & tools help a lot, so that the next gen[7] (GPT-5, Claude 4) + Python + a for loop can reach the grand prize mark[8]: 60%
Same but for the gen after that (GPT-6, Claude 5): 75%
The current architecture, including scaffolding & tools, continues to improve to the point of being able to do original AI research: 65%, with high uncertainty[9]
Further reading
Foundational Challenges in Assuring Alignment and Safety of Large Language Models, 04⁄24
Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought 10⁄22
Finding Backward Chaining Circuits in Transformers Trained on Tree Search 05⁄24
Faith and Fate: Limits of Transformers on Compositionality, 05⁄23
Papers from Kambhampati’s lab, including
NATURAL PLAN: Benchmarking LLMs on Natural Language Planning, 06⁄24
What Algorithms can Transformers Learn? A Study in Length Generalization, 10⁄23
ARC prize, 06⁄24 and On the Measure of Intelligence, 11⁄19
See also Chollet’s recent appearance on Dwarkesh Patel’s podcast.
Large Language Models Cannot Self-Correct Reasoning Yet, 10⁄23
- ^
Aschenbrenner also discusses ‘unhobbling’, which he describes as ‘fixing obvious ways in which models are hobbled by default, unlocking latent capabilities and giving them tools, leading to step-changes in usefulness’. He breaks that down into categories here. Scaffolding and tooling I discuss here; RHLF seems unlikely to help with fundamental reasoning issues. Increased context length serves roughly as a kind of scaffolding for purposes of this discussion. ‘Posttraining improvements’ is too vague to really evaluate. But note that his core claim (the graph here) ‘shows only the scaleup in base models; “unhobblings” are not pictured’.
- ^
Discussion of the ARC prize in the AIS and adjacent communities includes James Wilken-Smith, O O, and Jacques Thibodeaux.
- ^
Section 2.4 of the excellent “Foundational Challenges in Assuring Alignment and Safety of Large Language Models” is the closest I’ve seen to a thorough consideration of this issue from a safety perspective. Where this post attempts to provide an introduction, “Foundational Challenges” provides a starting point for a deeper dive.
- ^
This definition is neither complete nor uncontroversial, but is sufficient to capture the range of practical uncertainty addressed below. Feel free to mentally substitute ‘the sort of reasoning that would be needed to solve the problems described here.’ Or see my more recent attempt at a definition.
- ^
A major problem here is that they obfuscated in ways that made the challenge unnecessarily hard for LLMs by pushing against the grain of the English language. For example they use ‘pain object’ as a fact, and say that an object can ‘feast’ another object. Beyond that, the fully-obfuscated versions would be nearly incomprehensible to a human as well; eg ‘As initial conditions I have that, aqcjuuehivl8auwt object a, aqcjuuehivl8auwt object b...object b 4dmf1cmtyxgsp94g object c...‘. See Appendix 1 in ‘On the Planning Abilities of Large Language Models’. It would be valuable to repeat this experiment while obfuscating in ways that were compatible with what is, after all, LLMs’ deepest knowledge, namely how the English language works.
- ^
A useful intuition pump here might be the distinction between data stored in RAM and data swapped out to a disk cache. The same data can be retrieved in either case, but the former case is normal operation, whereas the latter case is referred to as “thrashing” and grinds the system nearly to a halt.
- ^
Assuming roughly similar compute increase ratio between gens as between GPT-3 and GPT-4.
- ^
This isn’t trivially operationalizable, because it’s partly a function of how much runtime compute you’re willing to throw at it. Let’s say a limit of 10k calls per problem.
- ^
This isn’t really operationalizable at all, I don’t think. But I’d have to get pretty good odds to bet on it anyway; I’m neither willing to buy nor sell at 65%. Feel free to treat as bullshit since I’m not willing to pay the tax ;)
The Natural Plan paper has an insane amount of errors in it. Reading it feels like I’m going crazy.
This meeting planning task seems unsolvable:
The solution requires traveling from SOMA to Nob Hill in 10 minutes, but the text doesn’t mention the travel time between SOMA and Nob Hill. Also the solution doesn’t mention meeting Andrew at all, even though that was part of the requirements.
Here’s an example of the trip planning task:
The trip is supposed to be 14 days, but requires visiting Bucharest for 5 days, London for 4 days, and Reykjavik for 7 days. I guess the point is that you can spend a day in multiple cities, but that doesn’t match with an intuitive understanding of what it means to “spend N days” in a city. Also, by that logic you could spend a total of 28 days in different cities by commuting every day, which contradicts the authors’ claim that each problem only has one solution.
Thanks for evaluating it in detail. I assumed that they at least hadn’t screwed up the problems! Editing the piece to note that the paper has problems.
Disappointingly, a significant number of existing benchmarks & evals have problems like that IIRC.
Thanks for writing this post!
IMO, these considerations do lengthen expected timelines, but not enough or certainly enough that we can ignore the possibility of very short timelines. The distribution of timelines matters a lot, not just the point estimate.
We are still not directing enough of our resources to this possibility if I’m right about that. We have limited time and brainpower in alignment, but it looks to me like relatively little is directed at scenarios in which the combination of scaling and scaffolding language models fairly quickly achieves competent agentic AGI. More on this below.
Excellent post, big upvote.
These are good questions, and you’ve explained them well.
I have a bunch to say on this topic. Some of it I’m not sure I should say in public, as even being convincing about the viability of this route will cause more people to work on it. So to be vague: I think applying cognitive psychology and cognitive neuroscience systems thinking to the problem suggests many routes to scaffold around the weaknesses of LLMs. I suspect the route from here to AGI is disjunctive; several approaches could work relatively quickly. There are doubtless challenges to implementing that scaffolding, as people have encountered in the last year, pursuing the most obvious and simple routes to scaffold LLMs into more capable agents.
A note on biases: I’m afraid people in AI safety are sometimes not taking short timelines seriously because it is emotionally difficult to hold both that view, and a pessimistic view of alignment difficulty. I myself am biased in both directions; there’s an emotional pull toward having my theories proven right, and seeing cool agents soon, and before they’re quite smart enough to take over. In other moments I am fervently hoping that LLMs are deceptively far from competent agentic AGI, and we more years to work and enjoy. I think it would be wonderful to keep pushing beyond guesses, as you suggest. I don’t know how to publicly estimate possible timelines without being specific enough and convincing enough to aid progress in that direction, so I intend to first write a post about the question: if we hypothetically could get to AGI from scaffolded LLMs, should we push in that direction, based on their apparently large advantages for alignment? I suspect the answer is yes, but I’m not going to make that decision unilaterally, and I haven’t yet gotten enough people to engage deeply enough in private to be sure enough.
So for now, I’ll just say: it really doesn’t look like we can rule it out, so we should be working on alignment for possible short timelines to LMA full AGI.
Which brings me back to the note above: lots of the limited resources in alignment are going into “aligning” language models. Scare quotes to indicate the disagreement over whether or how much that’s going to contribute to aligning agents built out of LLMs. The disagreement about whether prosaic alignment is enough or even contributes much is a nontrivial issue. I’ve posted about it in the past and am currently attempting to think through and write about it more clearly.
My current answer is that prosaic alignment work on LLMs helps in those scenarios, but doesn’t do the whole job. There are very separate issues in how LLM-based agents will be given explicit goals, and the factors weighing on whether those goals are precise enough and reflexively stable in a system that gains full freedom and self-awareness. I would dearly love a few more collaborators in addressing that part of what still looks like the single most likely AGI scenario, and almost certainly the fastest route to AGI if it works.
Few thoughts
- actually, these considerations mostly increase uncertainty and variance about timelines; if LLMs miss some magic sauce, it is possible smaller systems with the magic sauce could be competitive, and we can get really powerful systems sooner than Leopold’s lines predict
- my take on what is one important thing which makes current LLMs different from humans is the gap described in Why Simulator AIs want to be Active Inference AIs; while that post intentionally avoids having a detailed scenario part, I think the ontology introduced is better for thinking about this than scaffolding
- not sure if this is clear to everyone, but I would expect the discussion of unhobbling being one of the places where Leopold would need to stay vague to not breach OpenAI confidentiality agreements; for example, if OpenAI was putting a lot of effort into make LLM-like systems be better at agency, I would expect he would not describe specific research and engineering bets
My short timelines have their highest probability path going through:
Current LLMs get scaled enough that they are capable of automating search for new and better algorithms.
Somebody does this search and finds something dramatically better than transformers.
A new model trained on this new architecture repeats the search, but even more competently. An even better architecture is found.
The new model trained on this architecture becomes AGI.
So it seems odd to me that so many people seem focused on transformer-based LLMs becoming AGI just through scaling. That seems theoretically possible to me, but I expect it to be so much less efficient that I expect it to take longer. Thus, I don’t expect that path to pay off before algorithm search has rendered it irrelevant.
My crux is that LLMs are inherently bad at search tasks over a new domain. Thus, I don’t expect LLMs to scale to improve search.
Anecdotal evidence: I’ve used LLMs extensively and my experience is that LLMs are great at retrieval but terrible at suggestion when it comes to ideas. You usually get something resembling an amalgamation of Google searches vs. suggestions from some kind of insight.
[EDIT: @ChosunOne convincingly argues below that the paper I cite in this comment is not good evidence for search, and I would no longer claim that it is, although I’m not necessarily sold on the broader claim that LLMs are inherently bad at search (which I see largely as an expression of the core disagreement I present in this post).]
The recently-published ‘Evidence of Learned Look-Ahead in a Chess-Playing Neural Network’ suggests that this may not be a fundamental limitation. It’s looking at a non-LLM transformer, and the degree to which we can treat it as evidence about LLMs is non-obvious (at least to me). But it’s enough to make me hesitant to conclude that this is a fundamental limitation rather than something that’ll improve with scale (especially since we see performance on planning problems, which in my view are essentially search problems, improving with scale).
The cited paper in Section 5 (Conclusion-Limitations) states plainly:
The paper is more just looking at how Leela evaluates a given line rather than doing any kind of search. And this makes sense. Pattern recognition is an extremely important part of playing chess (as a player myself), and it is embedded in another system doing the actual search, namely Monte Carlo Tree Search. So it isn’t surprising that it has learned to look ahead in a straight line since that’s what all of its training experience is going to entail. If transformers were any good at doing the search, I would expect a chess bot without employing something like MCTS.
It’s not clear to me that there’s a very principled distinction between look-ahead and search, since there’s not a line of play that’s guaranteed to happen. Search is just the comparison of look-ahead on multiple lines. It’s notable that the paper generally talks about “look-ahead or search” throughout.
That said, I haven’t read this paper very closely, so I recognize I might be misinterpreting.
Or to clarify that a bit, it seems like the reason to evaluate any lines at all is in order to do search, even if they didn’t test that. Otherwise what would incentivize the model to do look-ahead at all?
In chess, a “line” is sequence of moves that are hard to interrupt. There are kind of obvious moves you have to play or else you are just losing (such as recapturing a piece, moving king out of check, performing checkmate etc). Leela uses the neural network more for policy, which means giving a score to a given board position, which then the MCTS can use to determine whether or not to prune that direction or explore that section more. So it makes sense that Leela would have an embedding of powerful lines as part of its heuristic, since it isn’t doing to main work of search. It’s more pattern recognition on the board state, so it can learn to recognize the kinds of lines that are useful and whether or not they are “present” in the current board state. It gets this information from the MCTS system as it trains, and compresses the “triggers” into the earlier evaluations, which then this paper explores.
It’s very cool work and result, but I feel it’s too strong to say that the policy network is doing search as opposed to recognizing lines from its training at earlier board states.
Ah, ok, thanks for the clarification; I assumed ‘line’ just meant ‘a sequence of moves’. I’m more of a go player than a chess player myself.
It still seems slightly fuzzy in that other than check/mate situations no moves are fully mandatory and eg recaptures may occasionally turn out to be the wrong move?
But I retract my claim that this paper is evidence of search, and appreciate you helping me see that.
Indeed it can be difficult to know when it is actually better not to continue the line vs when it is, but that is precisely what MCTS would help figure out. MCTS would do actual exploration of board states and the budget for which states it explores would be informed by the policy network. It’s usually better to continue a line vs not, so I would expect MCTS to spend most of its budget continuing the line, and the policy would be updated during training with whether or not the recommendation resulted in more wins. Ultimately though, the policy network is probably storing a fuzzy pattern matcher for good board states (perhaps encoding common lines or interpolations of lines encountered by the MCTS) that it can use to more effectively guide the search by giving it an appropriate score.
To be clear, I don’t think a transformer is completely incapable of doing any search, just that it is probably not learning to do it in this case and is probably pretty inefficient at doing it when prompted to.
Sorry to be that guy but maybe this idea shouldn’t be posted publicly (I never read it before)
How is this not basically the widespread idea of recursive self improvement? This idea is simple enough that it has occurred even to me, and there is no way that, e.g. Ilya Sutskever hasn’t thought about that.
I guess the vague idea is in the water. Just never saw it stated so explicitly. Not a big deal.
I agree with most of this. My claim here is mainly that if this is the case, then there’s at least one remaining necessary breakthrough, of unknown difficulty, before AGI, and so we can’t naively extrapolate timelines from LLM progress to date.
I additionally think that if this is the case, then LLMs’ difficulty with planning is evidence that they may not be great at automating search for new and better algorithms, although hardly conclusive evidence.
Yeah, I think my claim needs evidence to support it. That’s why I’m personally very excited to design evals targeted at detecting self-improvement capabilities.
We shouldn’t be stuck guessing about something so important!
It might also be a crux for alignment, since scalable alignment schemes like IDA and Debate rely on “task decomposition”, which seems closely related to “planning” and “reasoning”. I’ve been wondering about the slow pace of progress of IDA and Debate. Maybe it’s part of the same phenomenon as the underwhelming results of AutoGPT and BabyAGI?
If that’s the case (which seems very plausible) then it seems like we’ll either get progress on both LLM-based AGI and IDA/Debate, or on neither. That seems like a relatively good situation; those approaches will work for alignment if & only if we need them (to whatever extent they would have worked in the absence of this consideration).
There’s two other ways for things to go wrong though:
AI capabilities research switches attention from LLM (back) to RL. (There was a lot of debate in the early days of IDA about whether it would be competitive with RL, and part of that was about whether all the important tasks we want a highly capable AI to do could be broken down easily enough and well enough.)
The task decomposition part starts working well enough, but Eliezer’s (and others’) concern about “preserving alignment while amplifying capabilities” proves valid.
You note something similar, but I think it is pretty notable how much harder the obfuscated problems would be for humans:
Yeah, it’s quite frustrating that they made the obfuscated problems so unnecessarily & cryptically ungrammatical. And the randomized version would be absolutely horrendous for humans:
I’m fairly tempted to take time to redo those experiments with a more natural obfuscation scheme that follows typical English grammar. It seems pretty plausible to me that LLMs would then do much better (and also pretty plausible that they wouldn’t).
Largely echoing the points above, but I think a lot of Kambhampati’s cases (co-author on the paper you cite) stack the deck against LLMs in an unfair way. E.g., he offered the following problem to the NYT as a contemporary LLM failure case.
When I read that sentence, it felt needlessly hard to parse. So I formatted the question in a way that felt more natural (see below), and Claude Opus appears to have no problem with it (3.5 Sonnet seems less reliable, haven’t tried with other models).
Tbc, I’m actually somewhat sympathetic to Kambhampati’s broader claims about LLMs doing something closer to “approximate retrieval” rather than “reasoning”. But I think it’s sensible to view the Blocksworld examples (and many similar cases) as providing limited evidence on that question.
Claude 3 Opus just did fine for me using the original problem statement as well:
[edited to show the temperature-0 response rather than the previous (& also correct) temperature-0.7 response, for better reproducibility]
Doesn’t the problem have no solution without a spare block?
Worth noting that LLMs don’t see a nicely formatted numeric list, they see a linear sequence of tokens, e.g. I can replace all my newlines with something else and Copilot still gets it:
brief testing doesn’t show worse completions than when there are newlines. (and in the version with newlines this particular completion is oddly incomplete.)
Anyone know how LLMs tend to behave on text that is ambiguous―or unambiguous but “hard to parse”? I wonder if they “see” a superposition of meanings “mixed together” and produce a response that “sounds good for the mixture”.
That seems basically right to me; Janus presents that view well in “Simulators”.
Yes, but note in the simulator/Bayesian meta-RL view, it is important that the LLMs do not “produce a response”: they produce a prediction of ‘the next response’. The logits will, of course, try to express the posterior, averaging across all of the possibilities. This is what the mixture is: there’s many different meanings which are still possible, and you’re not sure which one is ‘true’ but they all have a lot of different posterior probabilities by this point, and you hedge your bets as to the exact next token as incentivized by a proper scoring rule which encourages you to report the posterior probability as the output which minimizes your loss. (A hypothetical agent may be trying to produce a response, but so too do all of the other hypothetical agents which are live hypotheses at that point.) Or it might be clearer to say, it produces predictions of all of the good-sounding responses, but never produces any single response.
Everything after that prediction, like picking a single, discrete, specific logit and ‘sampling’ it to fake ‘the next token’, is outside the LLM’s purview except insofar as it’s been trained on outputs from such a sampling process and has now learned that’s one of the meanings mixed in. (When Llama-3-405b is predicting the mixture of meanings of ‘the next token’, it knows ChatGPT or Claude could be the LLM writing it and predicts accordingly, but it doesn’t have anything really corresponding to “I, Lama-3-405b, am producing the next token by Boltzmann temperature sampling at x temperature”. It has a hazy idea what ‘temperature’ is from the existing corpus, and it can recognize when a base model—itself—has been sampled from and produced the current text, but it lacks the direct intuitive understanding implied by “produce a response”.) Hence all of the potential weirdness when you hardwire the next token repeatedly and feed it back in, and it becomes ever more ‘certain’ of what the meaning ‘really’ is, or it starts observing that the current text looks produced-by-a-specific-sampling-process rather than produced-by-a-specific-human, etc.
Absolutely! In the comment you’re responding to I nearly included a link to ‘Role-Play with Large Language Models’; the section there on playing 20 questions with a model makes that distinction really clear and intuitive in my opinion.
Just for clarification, I think you’re just saying here that the model doesn’t place all its prediction mass on one token but instead spreads it out, correct? Another possible reading is that you’re saying that the model tries to actively avoid committing to one possible meaning (ie favors next tokens that maintain superposition), and I thought I remembered seeing evidence that they don’t do that.
Yes. For a base model. A tuned/RLHFed model however is doing something much closer to that (‘flattened logits’), and this plays a large role in the particular weirdnesses of those models, especially as compared to the originals (eg. it seems like maybe they suck at any kind of planning or search or simulation because they put all the prediction mass on the max-arg token rather than trying to spread mass out proportionately and so if that one token isn’t 100% right, the process will fail).
Hm, I don’t think base models would necessarily do that, no. I can see the tuned models having the incentives to train them to do so (eg. the characteristic waffle and non-commitment and vagueness are presumably favored by raters), but not the base models.
They are non-myopic, so they’re incentivized to plan ahead, but only insofar as that predicts the next token in the original training data distribution (because real tokens reflect planning or information from ‘the future’); unless real agents are actively avoiding commitment, there’s no incentive there to worsen your next-token prediction by trying to create an ambiguity which is not actually there.
(The ambiguity is in the map, not the territory. To be more concrete, imagine the ambiguity is over “author identity”, as the LLM is trying to infer whether ‘gwern’ or ‘eggsyntax’ wrote this LW comment. At each token, it maintains a latent about its certainty of the author identity; because it is super useful for prediction to know who is writing this comment, right? And the more tokens it sees for the prediction, the more confident it becomes the answer is ‘gwern’. But when I’m actually writing this, I have no uncertainty—I know perfectly well ‘gwern’ is writing this, and not ‘eggsyntax’. I am not in any way trying to ‘avoid committing to one possible [author]’ - the author is just me, gwern, fully committed from the start, whatever uncertainty a reader might have while reading this comment from start to finish. My next token, therefore, is not better predicted by imagining that I’m suffering from mental illness or psychedelics as I write this and thus might suddenly spontaneously claim to be eggsyntax and this text is deliberately ambiguous because at any moment I might be swerving from gwern to eggsyntax and back. The next token is better predicted by inferring who the author is to reduce ambiguity as much as possible, and expecting them to write in a normal non-ambiguous fashion given whichever author it actually is.)
Given that I think LLMs don’t generalize, I was surprised how compelling Aschenbrenner’s case sounded when I read it (well, the first half of it. I’m short on time...). He seemed to have taken all the same evidence I knew about it, and arranged it into a very different framing. But I also felt like he underweighted criticism from the likes of Gary Marcus. To me, the illusion of LLMs being “smart” has been broken for a year or so.
To the extent LLMs appear to build world models, I think what you’re seeing is a bunch of disorganized neurons and connections that, when probed with a systematic method, can be mapped onto things that we know a world model ought to contain. A couple of important questions are
the way that such a world model was formed and
how easily we can easily figure out how to form those models better/differently[1].
I think LLMs get “world models” (which don’t in fact cover the whole world) in a way that is quite unlike the way intelligent humans form their own world models―and more like how unintelligent or confused humans do the same.
The way I see it, LLMs learn in much the same way a struggling D student learns (if I understand correctly how such a student learns), and the reason LLMs sometimes perform like an A student is because they have extra advantages that regular D students do not: unlimited attention span and ultrafast, ultra-precise processing backed by an extremely large set of training data. So why do D students perform badly, even with “lots” of studying? I think it’s either because they are not trying to build mental models, or because they don’t really follow what their teachers are saying. Either way, this leads them to fall back on secondary “pattern-matching” learning mode which doesn’t depend on a world model.
If, when learning in this mode, you see enough patterns, you will learn an implicit world model. The implicit model is a proper world model in terms of predictive power, but
It requires much more training data to predict as well as a human system-2 can, which explains why D students perform worse than A students on the same amount of training data―and this is one of the reasons why LLMs need so much more training data than humans do in order to perform at an A level (other reasons: less compute per token, fewer total synapses, no ability to “mentally” generate training data, inability to autonomously choose what to train on). The way you should learn is to first develop an explicit worldmodel via system-2 thinking, then use system-2 to mentally generate training data which (along with external data) feeds into system-1. LLMs cannot do this.
Such a model tends to be harder to explain in words than an explicit world model, because the predictions are coming from system-1 without much involvement from system-2, and so much of the model is not consciously visible to the student, nor is it properly connected to its linguistic form, so the D student relies more on “feeling around” the system-1 model via queries (e.g. to figure out whether “citation” is a noun, you can do things like ask your system-1 whether “the citation” is a valid phrase―human language skills tend to always develop as pure system-1 initially, so a good linguistics course teaches you explicitly to perform these queries to extract information, whereas if you have a mostly-system-2 understanding of a language, you can use that to decide whether a phrase is correct with system-2, without an intuition about whether it’s correct. My system-1 for Spanish is badly underdeveloped, so I lean on my superior system-2/analytical understanding of grammar).
When an LLM cites a correct definition of something as if it were a textbook, then immediately afterward fails to apply that definition to the question you ask, I think that indicates the LLM doesn’t really have a world model with respect to that question, but I would go further and say that even if it has a good world model, it cannot express its world-model in words, it can only express the textbook definitions it has seen and then apply its implicit world-model, which may or may not match what it said verbally.
So if you just keep training it on more unique data, eventually it “gets it”, but I think it “gets it” the way a D student does, implicitly not explicitly. With enough experience, the D student can be competent, but never as good as similarly-experienced A students.
A corollary of the above is that I think the amount of compute required for AGI is wildly overestimated, if not by Aschenbrenner himself then by less nuanced versions of his style of thinking (e.g. Sam Altman). And much of the danger of AGI follows from this. On a meta level, my own opinions on AGI are mostly not formed via “training data”, since I have not read/seen that many articles and videos about AGI alignment (compared to any actual alignment researcher). No coincidence, then, that I was always an A to A- student, and the one time I got a C- in a technical course was when I couldn’t figure out WTF the professor was talking about. I still learned, that’s why I got a C-, but I learned in a way that seemed unnatural to me, but which incorporated some of the “brute force” that an LLM would use. I’m all about mental models and evidence; LLMs are about neither.
Aschenbrenner did help firm up my sense that current LLM tech leads to “quasi-AGI”: a competent humanlike digital assistant, probably one that can do some AI research autonomously. It appears that the AI industry (or maybe just OpenAI) is on an evolutionary approach of “let’s just tweak LLMs and our processes around them”. This may lead (via human ingenuity or chance discovery) to system-2s with explicit worldmodels, but without some breakthough, it just leads to relatively safe quasi-AGIs, the sort that probably won’t generate groundbreaking new cancer-fighting ideas but might do a good job testing ideas for curing cancer that are “obvious” or human-generated or both.
Although LLMs badly suck at reasoning, my AGI timelines are still kinda short―roughly 1 to 15 years for “real” AGI, with quasi-AGI in 2 to 6 years―mainly because so much funding is going into this, and because only one researcher needs to figure out the secret, and because so much research is being shared publicly, and because there should be many ways to do AGI, and because quasi-AGI (if invented first) might help create real AGI. Even the AGI safety people[2] might be the ones to invent AGI, for how else will they do effective safety research? FWIW my prediction is that quasi-AGI is consists of a transformer architecture with quite a large number of (conventional software) tricks and tweaks bolted on to it, while real AGI consists of transformer architecture plus a smaller number of tricks and tweaks, plus a second breakthrough of the same magnitude as transformer architecture itself (or a pair of ideas that work so well together that combining them counts as a breakthrough).
EDIT: if anyone thinks I’m on to something here, let me know your thoughts as to whether I should redact the post lest changing minds in this regard is itself hazardous. My thinking for now, though, is that presenting ideas to a safety-conscious audience might well be better than safetyists nodding along to a mental model that I think is, if not incorrect, then poorly framed.
I don’t follow ML research, so let me know if you know of proposed solutions already.
why are we calling it “AI safety”? I think this term generates a lot of “the real danger of AI is bias/disinformation/etc” responses, which should decrease if we make the actual topic clear
As someone who has been studying LLM outputs pretty intently since GPT-2, I think you are mostly right but that the details do matter here.
The LLMs give a very good illusion of being smart, but are actually kinda dumb underneath. Yes. But… with each generation they get a little less dumb, a little more able to reason and extrapolate. The difference between ‘bad’ and ‘bad, but not as bad as they used to be, and getting rapidly better’ is pretty important.
They are also bad at ‘integrating’ knowledge. This results in having certain facts memorized, but getting questions where the answer is indicated by those facts wrong when the questions come from an unexpected direction. I haven’t noticed steady progress on factual knowledge integration in the same way I have with reasoning. I do expect this hurdle will be overcome eventually. Things are progressing quite quickly, and I know of many advances which seem like compatible pareto improvements which have not yet been integrated into the frontier models because the advances are too new.
Also, I notice that LLMs are getting gradually better at being coding assistants and speeding up my work. So I don’t think it’s necessarily the case that we need to get all the way to full human-level reasoning before we get substantial positive feedback effects on ML algorithm development rate from improved coding assistance.
I’m having trouble discerning a difference between our opinions, as I expect a “kind-of AGI” to come out of LLM tech, given enough investment. Re: code assistants, I’m generally disappointed with Github Copilot. It’s not unusual that I’m like “wow, good job”, but bad completions are commonplace, especially when I ask a question in the sidebar (which should use a bigger LLM). Its (very hallucinatory) response typically demonstrates that it doesn’t understand our (relatively small) codebase very well, to the point where I only occasionally bother asking. (I keep wondering “did no one at GitHub think to generate an outline of the app that could fit in the context window?”)
Yes, I agree our views are quite close. My expectations closely match what you say here:
Basically I just want to point out that the progression of competence in recent models seems pretty impressive, even though the absolute values are low.
For instance, for writing code I think the following pattern of models (including only ones I’ve personally tested enough to have an opinion) shows a clear trend of increasing competence with later release dates:
Github Copilot (pre-GPT-4) < GPT-4 (the first release) < Claude 3 Opus < Claude 3.5 Sonnet
Basically, I’m holding in my mind the possibility that the next versions (GPT-5 and/or Claude Opus 4) will really impress me. I don’t feel confident of that. I am pretty confident that the version after next will impress me (e.g. GPT-6 / Claude Opus 5) and actually be useful for RSI.
From this list, Claude 3.5 Sonnet is the first one to be competent enough I find it even occasionally useful. I made myself use the others just to get familiar with their abilities, but their outputs just weren’t worth the time and effort on average.
P.S. if I’m wrong about the timeline―if it takes >15 years―my guess for how I’m wrong is (1) a major downturn in AGI/AI research investment and (2) executive misallocation of resources. I’ve been thinking that the brightest minds of the AI world are working on AGI, but maybe they’re just paid a lot because there are too few minds to go around. And when I think of my favorite MS developer tools, they have greatly improved over the years, but there are also fixable things that haven’t been fixed in 20 years, and good ideas they’ve never tried, and MS has created a surprising number of badly designed libraries (not to mention products) over the years. And I know people close to Google have a variety of their own pet peeves about Google.
Are AGI companies like this? Do they burn mountains cash to pay otherwise average engineers who happen to have AI skills? Do they tend to ignore promising research directions because the results are uncertain, or because results won’t materialize in the next year, or because they don’t need a supercomputer or aren’t based mainly on transformers? Are they bad at creating tools that would’ve made the company more efficient? Certainly I expect some companies to be like that.
As for (1), I’m no great fan of copyright law, but today’s companies are probably built on a foundation of rampant piracy, and litigation might kill investment. Or, investors may be scared away by a persistent lack of discoveries to increase reliability / curtail hallucinations.
Thanks for your comments! I was traveling and missed them until now.
I think we’ve certainly seen some examples of interpretability papers that ‘find’ things in the models that aren’t there, especially when researchers train nonlinear probes. But the research community has been learning over time to distinguish cases like that from from what’s really in the model (ablation, causal tracing, etc). We’ve also seen examples of world modeling that are clearly there in the model; Neel Nanda’s work finding a world model in Othello-GPT is a particularly clear case in my opinion (post, paper).
My intuitions about human learning here are very different from yours, I think. In my view, learning (eg) to produce valid sentences in a native language and to understand sentences from other speakers is very nearly the only thing that matters, and that’s something nearly all speakers achieve. Learning an explicit model for that language, in order to eg produce a correct parse tree, matters a tiny bit, very briefly, when you learn parse trees in school. Rather than intelligent humans learning a detailed explicit model of their language and unintelligent humans not doing so, it seems to me that very few intelligent humans have such a model. Mostly it’s just linguists, who need an explicit model. I would further claim that those who do learn an explicit model don’t end up being significantly better at producing and understanding language in their day-to-day lives; it’s not explicit modeling that makes us good at that.
I do agree that someone without an explicit model of a topic will often have a harder time explaining that topic to someone else, and I agree that LLMs typically learn implicit rather than explicit models. I just don’t think that that in and of itself makes them worse at using those models.
That said, to the extent that by ‘general reasoning’ we mean chains of step-by-step assertions with each step explicitly justified by valid rules of reasoning, that does seem like something that benefits a lot from an explicit model. So in the end I don’t necessarily disagree with your application of this idea to at least some versions of general reasoning; I do disagree when it comes to other sorts of general reasoning, and LLM capabilities in general.
Curated.
This is a fairly straightforward point, but one I haven’t seen written up before and I’ve personally been wondering a bunch about. I appreciated this post both for laying out the considerations pretty thoroughly, including a bunch or related reading, and laying out some concrete predictions at the end.
I feel like I have been going on about this for years. Like here, here or here. But I’d be the first to admit, that I don’t really do effort posts.
I hadn’t seen your posts either (despite searching; I think the lack of widely shared terminology around this problem gets in the way). I’d be very interested to learn more about how your research agenda has progressed since that first post. This post was mostly intended to be broad audience / narrow message, just (as Raemon says) pointing to the crux here, breaking it down, and giving a sense of the arguments on each side.
The post about learned lookahead in Leela has kind of galvanised me into finally finishing an investigation I have worked on for too long already. (Partly because I think that finding is incorrect, but also because using Leela is a great idea, I had got stuck with LLMs requiring a full game for each puzzle position).
I will ping you when I write it up.
I’m looking forward to it!
It so happens I hadn’t seen your other posts, although I think there is something that this post was aiming at, that yours weren’t quite pointed at, which is laying out “this is a crux for timelines, these are the subcomponents of the crux.” (But, I haven’t read your posts in detail yet and thought about what else they might be good at that this post wasn’t aiming for)
I always feel like self-play on math with a proof checker like Agda or Coq is a promising way to make LLMs superhuman on these areas. Do we have any strong evidence that it’s not?
Do you mean as a (presumably RL?) training method to make LLMs themselves superhuman in that area, or that the combined system can be superhuman? I think AlphaCode is some evidence for the latter, with the compiler in the role of proof-checker.
The former
I believe there is considerable low-hanging algorithmic fruit that can make LLMs better at reasoning tasks. I think these changes will involve modifications to the architecture + training objectives. One major example is highlighted by the work of https://arxiv.org/abs/2210.10749, which show that Transformers can only heuristically implement algorithms to most interesting problems in the computational complexity hierarchy. With recurrence (e.g. through CoT https://arxiv.org/abs/2310.07923) these problems can be avoided, which might lead to much better generic, domain-independent reasoning capabilities. A small number of people are already working on such algorithmic modifications to Transformers (e.g. https://arxiv.org/abs/2403.09629).
This is to say that we haven’t really explored small variations on the current LLM paradigm, and it’s quite likely that the “bugs” we see in their behavior could be addressed through manageable algorithmic changes + a few OOMs more of compute. For this reason, if they make a big difference, I could see capabilities changing quite rapidly once people figure out how to implement them. I think scaling + a little creativity is alive and well as a pathway to nearish-term AGI.
‘The Expressive Power of Transformers with Chain of Thought’ is extremely interesting, thank you! I’ve noticed a tendency to conflate the limitations of what transformers can do in a forward pass with what they can do under autoregressive conditions, so it’s great to see research explicitly addressing how the latter extends the former.
I agree that this is plausible. I mentally lumped this sort of thing into the ‘breakthrough needed’ category in the ‘Why does this matter?’ section. Your point is well-taken that there are relatively small improvements that could make the difference, but to me that has to be balanced against the fact that there have been an enormous number of papers claiming improvements to the transformer architecture that then haven’t been adopted.
From outside the scaling labs, it’s hard to know how much of that is the improvements not panning out vs a lack of willingness & ability to throw resources at pursuing them. One the one hand I suspect there’s an incentive to focus on the path that they know is working, namely continuing to scale up. On the other hand, scaling the current architecture is an extremely compute-intensive path, so I would think that it’s worth putting resources into trying to see whether these improvements would work well at scale. If you (or anyone else) has insight into the degree to which the scaling labs are actually trying to incorporate the various claimed improvements, I’d be quite interested to know.
How much does o1-preview update your view? It’s much better at Blocksworld for example.
https://x.com/rohanpaul_ai/status/1838349455063437352
https://arxiv.org/pdf/2409.19924v1
Thanks for sharing, I hadn’t seen those yet! I’ve had too much on my plate since o1-preview came out to really dig into it, in terms of either playing with it or looking for papers on it.
Quite substantially. Substantially enough that I’ll add mention of these results to the post. I saw the near-complete failure of LLMs on obfuscated Blocksworld problems as some of the strongest evidence against LLM generality. Even more substantially since one of the papers is from the same team of strong LLM skeptics (Subbarao Kambhampati’s) who produced the original results (I am restraining myself with some difficulty from jumping up and down and pointing at the level of goalpost-moving in the new paper).
There’s one sense in which it’s not an entirely apples-to-apples comparison, since o1-preview is throwing a lot more inference-time compute at the problem (in that way it’s more like Ryan’s hybrid approach to ARC-AGI). But since the key question here is whether LLMs are capable of general reasoning at all, that doesn’t really change my view; certainly there are many problems (like capabilities research) where companies will be perfectly happy to spend a lot on compute to get a better answer.
Here’s a first pass on how much this changes my numeric probabilities—I expect these to be at least a bit different in a week as I continue to think about the implications (original text italicized for clarity):
LLMs continue to do better at block world and ARC as they scale: 75% → 100%, this is now a thing that has happened (note that o1-preview also showed substantially improved results on ARC-AGI).
LLMs entirely on their own reach the grand prize mark on the ARC prize (solving 85% of problems on the open leaderboard) before hybrid approaches like Ryan’s: 10% → 20%, this still seems quite unlikely to me (especially since hybrid approaches have continued to improve on ARC). Most of my additional credence is on something like ‘the full o1 turns out to already be close to the grand prize mark’ and the rest on ‘OpenAI capabilities researchers manage to use the full o1 to find an improvement to current LLM technique (eg a better prompting approach) that can be easily fixed’.
Scaffolding & tools help a lot, so that the next gen[7] (GPT-5, Claude 4) + Python + a for loop can reach the grand prize mark[8]: 60% → 75% -- I’m tempted to put it higher, but it wouldn’t be that surprising if o1-mark-2 didn’t quite get there even with scaffolding/tools, especially since we don’t have clear insight into how much harder the full test set is.
Same but for the gen after that (GPT-6, Claude 5): 75% → 90%? I feel less sure about this one than the others; it sure seems awfully likely that o2 plus scaffolding will be able to do it! But I’m reluctant to go past 90% because progress could level off because of training data requirements, maybe the o1 → o2 jump doesn’t focus on optimizing for general reasoning, etc. It seems very plausible that I’ll bump this higher on reflection.
The current architecture, including scaffolding & tools, continues to improve to the point of being able to do original AI research: 65%, with high uncertainty[9] → 80%. That sure does seem like the world we’re living in. It’s not clear to me that o1 couldn’t already do original AI research with the right scaffolding. Sakana claims to have gotten there with GPT-4o / Sonnet, but their claims seem overblown to me.
Now that I’ve seen these, I’m going to have to think hard about whether my upcoming research projects in this area (including one I’m scheduled to lead a team on in the spring, uh oh) are still the right thing to pursue. I may write at least a brief follow-up post to this one arguing that we should all update on this question.
Thanks again, I really appreciate you drawing my attention to these.
I’ve now expanded this comment to a post—mostly the same content but with more detail.
https://www.lesswrong.com/posts/wN4oWB4xhiiHJF9bS/llms-look-increasingly-like-general-reasoners
I think that too much scafolding can obfuscate a lack of general capability, since it allows the system to simulate a much more capable agent—under narrow circumstances and assuming nothing unexpected happens.
Consider the Egyptian Army in ’73. With exhaustive drill and scripting of unit movements, they were able to simulate the capabilities of an army with a competent officer corps, up until they ran out of script, upon which it reverted to a lower level of capability. This is because scripting avoids officers on the ground needing to make complex tactical decisions on the fly and communicate them to other units, all while maintaining a cohesive battle plan. If everyone sticks to the script, big holes won’t open up in their defenses, and the movements of each unit will be covered by that of others. When the script ran out (I’m massively simplifying), the cohesion of the army began to break down, rendering it increasingly vulnerable to IDF counterattacks. The gains in combat effectiveness were real, but limited to the confines of the script.
Similarly, scafolding helps the AI avoid the really hard parts of a job, at least the really hard parts for it. Designing the script for each individual task and subtask in order to make a 90% reliable AI economically valuable turns a productivity-improving tool into an economically productive agent, but only within certain parameters, and each time you encounter a new task, more scafolding will need to be built. I think some of the time the harder (in the human-intuitive sense) parts of the problem may be contained in the scafolding as opposed to the tasks the AI completes.
Thus, given the highly variable nature of LLM intelligence, “X can do Y with enough scafolding!” doesn’t automatically convince me that X possesses the core capabilities to do Y and just needs a little encouragement or w/e. If may be that task Y is composed of subtasks A and B, such that X is very good and reliable at A, but utterly incapable at B (coding and debugging?). By filtering for Y with a certain easy subset of B, using a pipeline to break it down into easier subtasks with various prompts, trying many times, and finally passing off unsolved cases to humans, you can extract much economic from X doing Y, but only in a certain subset of cases, and still without X being reliably good at doing both A and B.
You could probably do something similar with low-capability human programmers playing the role of X, but it wouldn’t be economical since they cost much more than an LLM and are in some ways less predictable.
I think a lot of economically valuable intelligence is in the ability to build the scafolding itself implicitly, which many people would call “agency”.
What if the tasks that your scaffolded LLM is doing are randomly selected pieces of cognitive labor from the full distribution of human cognitive tasks?
It seems to me like your objection is mostly to narrow distributions of tasks and scaffolding which is heavily specialized to that task.
I think narrowness of the task and amount of scaffolding might be correlated in practice, but these attributes don’t have to be related.
(You might think they are correlated because large amounts of scaffolding won’t be very useful for very diverse tasks. I think this is likely false—there exists general purpose software that I find useful for a very broad range of tasks. E.g. neovim. I agree that smart general agents should be able to build their own scaffolding and bootstrap, but its worth noting that the final system might be using a bunch of tools!)
For humans, we can consider eyes to be a type of scaffolding: they help us do various cognitive tasks by adding various affordances but are ultimately just attached.
Nonetheless, I predict that if I didn’t have eyes, I would be notably less efficient at my job.
Very interesting example, thanks.
Agreed that that wouldn’t be good evidence that those systems could do general reasoning. My intention in this piece is to mainly consider general-purpose scaffolding rather than task-specific.
It’s pretty unclear to me that the LLMs do much worse than humans at this task.
They establish the humans baseline by picking one problem at random out of 600 and evaluating 50 humans on this. (Why only one problem!? It would be vastly more meaningful if you check 5 problems with 10 humans assigned to each problem!) 78% of humans succeed.
(Human participants are from Prolific.)
On randomly selected problems, GPT-4 gets 35% right and I bet this improves with better prompting.
So, GPT-4 is maybe 2x worse than humans with huge error bars (and with what seems to be rapid improvement with scale).
Modulo the questions around obfuscation (which you raise in your other comment), I agree. Kambhampati emphasizes that they’re still much worse than humans. In my view the performance of the next gen of LLMs will tell us a lot about whether to take his arguments seriously.
We do not in fact have strong evidence for this. There does not exist any baseline for ARC puzzles among humans, smart or otherwise, just a claim that two people the designers asked to attempt them were able to solve them all. It seems entirely plausible to me that the best score on that leaderboard is pretty close to the human median.
Edit: I failed to mention that there is a baseline on the test set, which is different from the eval set that is used for the scoreboard and is, I believe, significantly easier.
Their website cites https://cims.nyu.edu/~brenden/papers/JohnsonEtAl2021CogSci.pdf as having found an average 84% success rate on the tested subset of puzzles.
It is worth noting that LLM based approachs can perform reasonably well on the train set. For instance, my approach gets 72%.
The LLM based approach works quite differently from how a human would normally solve the problem, and if you give LLMs “only one attempt” or otherwise limit them to do a qualitatively similar amount of reasoning as with humans I think they do considerably worse than humans. (Though to make this “only one attempt” baseline fair, you have to allow for the iteration that humans would normally do.)
Yeah, I failed to mention this. Edited to clarify what I meant.
Thanks for finding a cite. I’ve definitely seen Chollet (on Twitter) give 85% as the success rate on the (easier) training set (and the paper picks problems from the training set as well).
There is important context here.
I also think this is plausible—note that randomly selected examples from the public evaluation set are often considerably harder than the train set on which there is a known MTurk baseline (which is an average of 84%).
In https://situational-awareness.ai/from-gpt-4-to-agi/#Unhobbling, “scaffolding” is explicitly named as a thing being worked on, so I take it that progress in scaffolding is already included in the estimate. Nothing about that estimate is “just scaling”.
And AFAICT neither Chollet nor Knoop made any claims in the sense that “scaffolding outside of LLMs won’t be done in the next 2 years” ⇒ what am I missing that is the source of hope for longer timelines, please?
Thanks, yes, I should have mentioned ‘unhobbling’ in that sentence, have added.
I debated including a flowchart on that (given below); in the end I didn’t, but maybe I should have. But tl;dr, from the ‘Why does this matter’ section:
I agree with “Why does this matter” and with the “if … then …” structure of the argument.
But I don’t see from where do you see such high probability (>5%) of scaffolding not working… I mean whatever will work can be retroactively called “scaffolding”, even if it will be in the “one more major breakthrough” category—and I expect they were already accounted for in the unhobblings predictions.
Do we know the base rate how many years after initial marketing hype of a new software technology we should expect “effective” solutions? What is the usual promise:delivery story for SF startups / corporate presentations around VR, metaverse, crypto, sharing drives, sharing appartments, cybersecurity, industrial process automation, self-driving ..? How much hope should we take from the communication so far that the problem is hard to solve—did we expect before AutoGPT and BabyAGI that the first people who will share their first attempt should have been successful?
I’ve gone back and added my thoughts on unhobbling in a footnote: “
CholletAschenbrenner also discusses ‘unhobbling’, which he describes as ‘fixing obvious ways in which models are hobbled by default, unlocking latent capabilities and giving them tools, leading to step-changes in usefulness’. He breaks that down into categories here. Scaffolding and tooling I discuss here; RHLF seems unlikely to help with fundamental reasoning issues. Increased context length serves roughly as a kind of scaffolding for purposes of this discussion. ‘Posttraining improvements’ is too vague to really evaluate. But note that his core claim (the graph here) ‘shows only the scaleup in base models; “unhobblings” are not pictured’.”Frankly I’d be hesitant to put > 95% on almost any claims on this topic. My strongest reason for suspecting that scaffolding might not work to get LLMs to AGI is updating on the fact that it doesn’t seem to have become a useful approach yet despite many people’s efforts (and despite the lack of obvious blockers). I certainly expect scaffolding to improve over where it is now, but I haven’t seen much reason to believe that it’ll enable planning and general reasoning capabilities that are enormously greater than LLMs’ base capabilities.
What I mean by scaffolding here is specifically wrapping the model in a broader system consisting of some combination of goal-direction, memory, and additional tools that the system can use (not ones that the model calls; I’d put those in the ‘tooling’ category), with a central outer loop that makes calls to the model. Breakthroughs resulting in better models wouldn’t count on my definition.
Thanks for the clarification, I don’t share the intuition this will prove harder than other hard software engineering challenges in non-AI areas that weren’t solved in months but were solved in years and not decades, but other than “broad baseline is more significant than narrow evidence for me” I don’t have anything more concrete to share.
A note until fixed:Chollet also discusses ‘unhobbling’→Aschenbrenner also discusses ‘unhobbling’I think the shift of my intuition over the past year has looked something like: a) (a year ago) LLMs seem really smart and general (especially given all the stuff they unexpectedly learned like translation), but they lack goals and long-term memory, I bet if we give them that they’ll be really impressive. b) Oh, huh, if we add goals and long-term memory they don’t actually do that well. c) Oh, huh, they fail at stuff that seems pretty basic relative to how smart and general they seem. d) OK, probably we should question our initial impression of how smart and general they are. I realize that’s not really a coherent argument; just trying to give a sense of the overall shape of why I’ve personally gotten more skeptical.
I think that people don’t account for the fact that scaling means decreasing space for algorithmic experiments, you can’t train GPT-5 10000 times making small tweaks each time. Some algorithmic improvements can show effect only on large scales or if they are implemented in training from scratch, therefore, such improvements are hard to find.
I don’t think recent and further scaling really changes the ever-present tradeoff between large full runs and small experimental runs. That’s been a factor in training large neural networks since 2004 at least, the first time I was involved in attempts to deal with real-world datasets that benefit from scaling networks as far as the hardware allows.
Personally I believe that a novel algorithm/architecture which is substantially better than transformer-based LLMs is findable, and would show up even at small scale. I think the effect you are discussing is more of an issue for incremental improvements on the existing paradigm.
My point is that people can perceive difficulties with getting incremental improvement as strong evidence about LLMs being generally limited.
I work at a startup that would claim otherwise.
For example, the construct of “have the LLM write Python code, then have a Python interpreter execute it” is a powerful one (as Ryan Greenblatt has also shown), and will predictable get better as LLMs scale to be better at writing Python code with a lower density of bugs per N lines of code, and better at debugging it.
Can you name the startup? I’d be very interested to see what level of success it’s achieved.
(or if you can’t name the startup, I’d love to hear more about what’s been achieved—eg what are the largest & longest-horizon tasks that your scaffolded systems can reliably accomplish?)
You.com — turn on our “genius mode” (free users get a few uses per day) and try asking it to do a moderately complex calculation, or better still figure out how to do one then do it.
Generally we find the combination of RAG web search and some agentic tool use makes an LLM appreciably more capable. (Similarly, OpenAI are also doing the tool use, and Google the web-scale RAG.)
We’re sticking to fairly short-horizon tasks, generally less than a dozen steps.
@Daniel Tan raises an interesting possibility here, that LLMs are capable of general reasoning about a problem if and only if they’ve undergone grokking on that problem. In other words, grokking is just what full generalization on a topic is (Daniel please correct me if that’s a misrepresentation).
If that’s the case, my initial guess is that we’ll only see LLMs doing general reasoning on problems that are relatively small, simple (in the sense that the fully general algorithm is simple), and common in the training data, just because grokking requires such an extended amount of training. But I don’t have very clear intuition about the degree to which frontier-scale LLMs are grokking such small common problems during ordinary training.
Alternately it could be that grokking is sufficient but not necessary, so LLMs can reason in a general way about grokked problems but also about other things.
Yup, that’s basically what I think! IMO, grokking = having memorised the “underlying rules” that define the DGP, and these rules are general by definition.”Reasoning” is a loaded term that’s difficult to unpack, but I think a good working definition is “applying a set of rules to arrive at an answer”. In other words, reasoning is learning a “correct algorithm” to solve the problem. Therefore being able to reason correctly 100% of the time is equivalent to models having grokked their problem domain.
See this work, which finds that reasoning only happens through grokking. Separate work has trained models to do tree search, and found that backwards chaining circuits (a correct algorithm) emerge only through grokking. And also the seminal work on modular addition which found that correct algorithms emerge through grokking.
Note that the question of “is reasoning in natural language grokkable?” is a totally separate crux and one which I’m highly uncertain about.
I think the discussion, not specifically here but just in general, vastly underestimates the significance of this point. It isn’t like we expect humans to solve meeting planning problems in our heads. I use Calendly, or Outlook’s scheduling assistant and calendar. I plug all the destinations into our GPS and re-order them until the time looks low enough. One of the main reasons we want to use LLMs for these tasks at all is that, even with tool support, they are not trivial or instant for humans to solve.
There is also a reason why standardized tests for kids so often include essay questions on breaking down tasks step by step, like (to pick an example from my own past) “describe in as much detail as possible how you would make a peanut butter and jelly sandwich.” Even aspiring professional chefs have to learn proper mis en place to keep on top of their (much more complex) cooking tasks. I won’t bother listing more examples, but most humans are not naturally good at these tasks.
Yes, current LLMs are worse on many axes. IDK if that would be true if we built wrappers to let them use the planning tools humans rely on in practice, and if we put them through the kinds of practice humans use to learn these skills IRL. I suspect they still would be, but to a much lesser degree. But then I also can’t help thinking about the constant stream of incredible-lack-of-foresight things I see other humans do on a regular basis, and wonder if I’m just overestimating us.
FWIW, after I wrote this comment, I asked Gemini what it thought. It came up with a very similar POV about what its limitations were, what tools would help it, and how much those tools would close the gap with humans. Also, it linked this blog post in its reply. https://gemini.google.com/app/a72701429c8d830a
I often think here of @sarahconstantin ‘s excellent ‘Humans Who Are Not Concentrating Are Not General Intelligences’.
I think it is plausible but not obvious if this is the case, that large language models have a fundamental issue with reasoning. However, I don’t think this greatly impacts timelines. Here is my thinking:
I think time lines are fundamentally driven by scale and compute. We have a lot of smart people working on the problem, and there are a lot of obvious ways to address these limitations. Of course, given how research works, most of these ideas won’t work, but I am skeptical of the idea that such a counter-intuitive paradigm shift is needed that nobody has even conceived of it yet. A delay of a couple of years is possible, perhaps if the current tech stack proves remarkably profitable and the funding goes directly into the current paradigm. But as compute becomes bigger and cheaper, all the more easy it will be to rapidly try new ideas and architectures.
I think our best path forward to delaying timelines is to not build gigawatt scale data centers.
Two possible counterarguments:
I’ve heard multiple ML researchers argue that the last real breakthrough in ML architecture was transformers, in 2017. If that’s the case, and if another breakthrough of that size is needed, then the base rate maybe isn’t that high.
If LLMs hit significant limitations, because of the reasoning issue or because of a data wall, then companies & VCs won’t necessarily keep pouring money into ever-bigger clusters, and we won’t get the continued scaling you suggest.
That’s fair. Here are some things to consider:
1 - I think 2017 was not that long ago. My hunch is that the low level architecture of the network itself is not a bottleneck yet. I’d lean on more training procedures and algorithms. I’d throw RLHF and MoE as significant developments, and those are even more recent.
2 - I give maybe 30% chance of a stall, in the case little commercial disruption comes of LLMs. I think there will still be enough research going on at the major labs, and even universities at a smaller scale gives a decent chance at efficiency gains and stuff the big labs can incorporate. Then again, if we agree that they won’t build the power plant, that is also my main way of stalling the timeline 10 years. The reason I only put 30% is I’m expecting multi modalities and Aschenbrenner’s “unhobblings” to get the industry a couple more years of chances to find profit.
Both of those seem plausible, though the second point seems fairly different from your original ‘time lines are fundamentally driven by scale and compute’.
Thanks for a thoughtful article. Intuitively, LLMs are similar to our own internal verbalization. We often turn to verbalizing to handle various problems when we can’t keep our train of thought by other means. However, it’s clear they only cover a subset of problems; many others can’t be tackled this way. Instead, we lean on intuition, a much more abstract and less understood process that generates outcomes based on even more compressed knowledge. It feels that the same is true for LLMs. Without fully understanding intuition and the kind of data transformations and compressions it involves, reaching true AGI could be impossible.
That’s an interesting view, but it’s not clear to me what the evidence for it is. Is this based on introspection into thinking?
Although this new paper reviewing recent evidence on language may shed at least a bit of light on the topic: ‘Language is primarily a tool for communication rather than thought’.
I would appreciate if the community here could point me to research that agrees or disagrees with my claim and conclusions, below.
Claim: one pass through a transformer (of a given size) can only do a finite number of reasoning steps.
Therefore: If we want an agent that can plan over an unbounded number of steps (e.g. one that does tree-search), it will need some component that can do an arbitrary number of iterative or recursive steps.
Sub-claim: The above claim does not conflict with the Universal Approximation Theorem.
[Epistemic note: I’m going past my expertise here, just giving my own current understanding, I encourage people with more expertise on this (possibly including you) to correct me]
Merrill and Sabharwal have done some very useful-seeming work on how strong transformers can be, both under auto-regressive conditions (‘The Expressive Power of Transformers with Chain of Thought’) and in a single forward pass with assumptions about precision (eg ‘The Parallelism Tradeoff: Limitations of Log-Precision Transformers’), although I haven’t gone through it in detail. Certainly it’s unambiguously true that there are limitations to what can be done in a single forward pass.
As a side note, I’m not sure how tree search comes into play; in what way does tree search require unbounded steps that doesn’t apply equally to linear search?
No finite agent, recursive or otherwise, can plan over an unbounded number of steps in finite time, so it’s not immediately clear to me how iteration/recursion is fundamentally different in practice. I think the better comparison is the power of a transformer under auto-regressive conditions (note that intermediate steps don’t need to be shown to a user as output). The first paper linked above shows that given polynomial intermediate steps, transformers can handle exactly the class of problems that can be solved in polynomial time, aka P. So under those conditions they’re pretty strong and general (which certainly doesn’t mean that LLMs trained on next-token prediction are that strong and general).
One useful way to think about it, for me, is that auto-regression is a form of recursion, albeit one that’s bounded in current architecture by the context length.
Thanks for the references; I’ll need some time to review them. In the meanwhile, I’ll make some quick responses.
I intended tree search as just one example, since minimax tree search is a common example for game-based RL research.
In general, I agree. Though there are notable exceptions for cases such as (not mutually exclusive):
a closed form solution is found (for example, where a time-based simulation can calculate some quantity at an any arbitrary time step using the same amount of computation)
approximate solutions using a fixed number of computation steps are viable
a greedy algorithm can select the immediate next action that is equivalent to following a longer-term planning algorithm
Yes, like I said above, I agree in general and see your point.
As I’m confident we both know, some algorithms can be written more compactly when recursion/iteration are available. I don’t know how much computation theory touches on this; i.e. what classes of problems this applies to and why. I would make an intuitive guess that it is conceptually related to my point earlier about closed-form solutions.
Totally, good point!
A smart human given a long enough lifespan, sufficient motivation, and the ability to respawn could take over the universe (let’s assume we are starting in today’s society, but all technological progress is halted, except when it comes from that single person).
A LLM can’t currently.
Maybe we can better understand what we mean with general reasoning by looking at concrete examples of what we expect humans are capable of achieving in the limit.
I would like to see attempts to come up with a definition of “generality”. Animals seem to be very general, despite not being very intelligent compared to us.
We are clearly talking about something very different from this when we say animals are general. Animals can do none of those things. So are animals, except for humans, really narrow systems, not general ones? Or are we improperly mixing generality with intelligence when we talk about AI generality?
I do very much duck that question here (as I say in a footnote, feel free to mentally substitute ‘the sort of reasoning that would be needed to solve the problems described here’). I notice that the piece you link itself links to an Arbital piece on general intelligence, which I haven’t seen before but am now interested to read.
To be fair, Aschenbrenner explicitly mentions that what he terms “unhobbling” of LLMs will also be needed: he just expects progress in that to continue. The question then is whether the various weakness you’ve mentioned (and any other important ones) will be beaten by either scaling, or unhobbling, or a combination of the two.
Agreed. I added some thoughts on the relevance of Aschenbrenner’s unhobbling claims in a footnote:
“
CholletAschenbrenner also discusses ‘unhobbling’, which he describes as ‘fixing obvious ways in which models are hobbled by default, unlocking latent capabilities and giving them tools, leading to step-changes in usefulness’. He breaks that down into categories here. Scaffolding and tooling I discuss here; RHLF seems unlikely to help with fundamental reasoning issues. Increased context length serves roughly as a kind of scaffolding for purposes of this discussion. ‘Posttraining improvements’ is too vague to really evaluate. But note that his core claim (the graph here) ‘shows only the scaleup in base models; “unhobblings” are not pictured’.”A neural net can approximate any function. Given that LLMs are neural nets, I don’t see why they can’t also approximate any function/behaviour if given the right training data. Given how close they are getting to reasoning with basically unsupervised learning on a range of qualities of training data, I think they will continue to improve, and reach impressive reasoning abilities. I think of the “language” part of an LLM as like a communication layer on top of a general neural net. Being able to “think out loud” with a train of thought and a scratch pad to work with is a useful thing for a neural net to be able to do, similar to our own trains of thought IMO. It also is useful from a safety stand-point, as it would be quite the feat for back propagation itself to manage to betray us, before the model’s own visible thoughts do.
In the ‘Evidence for Generality’ section I point to a paper that demonstrates that the transformer architecture is capable of general computation (in terms of the types of formal languages it can express). A new paper, ‘Autoregressive Large Language Models are Computationally Universal’, both a) shows that this is true of LLMs in particular, and b) makes the point clearer by demonstrating that LLMs can simulate Lag systems, a formalization of computation which has been shown to be equivalent to the Turing machine (though less well-known).
I would definitely agree that if scale was the only thing needed, that could drastically shorten the timeline as compared to having to invent a completely new paradigm or AI, but even then that wouldn’t necessarily make it fast. Pure scale could still be centuries, or even millennia away assuming it would even work.
We have enough scaling to see how that works (massively exponential resources for linear gains), and given that extreme errors in reasoning (that are obvious to both experts and laypeople alike) are only lightly abated during massive amounts of scaling, it really does seem like reasoning isn’t dependent on just scale, or the required scale is absurdly large compared to what our society can afford (so either way it is probably slow if it happens.).
Personally, progress in the ‘intelligence’ part of artificial intelligence seems glacially slow to me. (Though it is sort of amazing that they can make models that can do these things and yet still can’t manage to extend the concepts much past what was directly trained on.) Current AI is good at interpolation (which is easy for humans too) and terrible at extrapolation (which is hard for humans too, but to completely different orders of magnitude). Current AI is possibly actually better at many kinds of interpolation than humans, but not in ways that much enhance its intelligence, because intelligence is much more related to extrapolation.
I think you dismiss the points of people like Gary Marcus in much to facile a manner. They aren’t saying, ‘this exact problem will never be solved’ but that they are only solved on a case by case basis (which seems to be largely true). You actually mention the failing on obfuscated examples which is a large part of their point of how they (Gary Marcus and company) know this. Obfuscated versions are ones that weren’t trained on, and thus rely on the poor reasoning abilities they manage.
Also, there is no reason to believe further scaling will always decrease error rate per step since this has often not been true! There are so many other things involved in the error rate than just scale, and it is likely scale’s contribution will stop really changing at some point. Asymptotes are, after all, a thing.
Also, GPT5 is only not already a thing because OpenAI couldn’t manage to keep improving performance meaningfully enough to use the name. Most likely the reason for their failure is that they have realized scale is not enough to meet their goals.
(At this point the new big thing o1 is out and it doesn’t seem impressive from the examples I’ve seen. That is a massive increase in inference time scale, which doesn’t help as much as you’d think if scale really worked.)
You could be right about (almost) all of that! I’m definitely not confident that scale is the only thing needed.
Part of the problem here is grounding these kinds of claims down to concrete predictions. What exactly counts as interpolation vs extrapolation? What exactly counts as progress on reasoning errors that’s more than ‘lightly abated during massive amounts of scaling’? That’s one reason I’m excited about the ARC-AGI contest; it provides a concrete benchmark for at least one sort of general reasoning (although it also involves a lot of subjectivity around what counts as a problem of the relevant kind).
I give a description here of an experiment specifically designed to test these questions. I’d be curious to hear your thoughts on it. What results would you anticipate? Does the task in this experiment count as interpolation or extrapolation in your view?
This is the one claim you make that seems unambiguously wrong to me; although of course there’s variation from architectural decisions etc, we’ve seen a really strong correlation between scale and loss, as shown in the various scaling laws papers. Of course this curve could change at some point but I haven’t seen any evidence that we’re close to that point.
Interpolation vs extrapolation is obviously very simple in theory; are you going in between points it has trained on or extending it outside of the training set. To just use math as an example (ODE solvers which are often relevant in AI but are not themselves AI), xnext = xcurr + 0.5dt (dxcurr + dxnext) is interpolation (Adams-Moulton with two points per interval), and xnext = xcurr + dt(1.5dxcurr − 0.5dxprev) is extrapolation [Adams-Bashforth with two points per interval]. The former is much better and the latter much worse (but cheaper and simpler to set up).
In practice, I agree that it is more than a bit fuzzy when evaluating complicated things like modern AI. My position that it is amazing at interpolation and has difficulties with extrapolation (though obviously people are very keen on getting it to do the latter without issues / hallucinations since we find it somewhat annoyingly difficult in many cases).
The proposed experiment should be somewhat a test of this, though hardly definitive (not that we as a society are at the stage to do definitive tests). It also seems pretty relevant to what people want that kind of AI to be able to do that it currently struggles at. It seems important to keep in mind that we should probably build things like this from the end to beginning, which is mentioned, so that we know exactly what the correct answer is before we ask, rather than assuming.
Perhaps one idea would be to do three varieties of question for each type of question:
1.Non-obfuscated but not in training data (we do less of this than sometimes thought)
2.Obfuscated directly from known training data
3.Obfuscated and not in training data
To see how each variation changes ability. (We also do have to keep in mind how the difficulty goes for humans, obviously since we are the comparison.)
As to your disagreement where you say scale has always decreased error rate, this may be true when the scale increase is truly massive, but I have seen scale not help on numerous things in image generation AI (which I find more interesting personally due to the fact that I have found LLMs rarely useful while I don’t have the skills to do art, especially photorealistic art), and larger is often worse at a number of specific tasks even ones that are clearly within the training sets.
I have found image generation AI progress very slow, though others think it fast. I feel the same way about LLMs, but errors matter more in usefulness for the latter.
For instance, Flux1 is generally well liked, and is very large compared to many other models, but when it comes to pictures of humans, the skin is often very plasticky and unrealistic compared to much smaller, earlier models, and the pictures are often very similar across prompts that should be very different compared to earlier models. Despite also using a much larger scale text encoder compared to previous ones too (adding an LLM known as T5XXL to what was previously used, which I gather isn’t an impressive one), prompt understanding often seems quite limited in specific areas despite t5xxl being several times larger (this is probably related to the lack of diversity in the output pictures as it ignores what it doesn’t understand). Flux1 itself also comes in multiple varieties with different tradeoffs all at the same scale that lead to very different results despite the fact that they were trained on largely the same data so far as we know. Small choices in setup seem more important than pure scale for what the capabilities are
To be specific, Image generation uses a lot less parameters than LLMs but require far more processing per parameter so the number look a lot smaller than LLMs. SD1 through SD1.5 is 0.9B parameters, SDXL is 4B, SD3 is a variety of sizes but the smallest in use is 2B (only one freely available to the public), Flux1 is 12B parameters. The flux text encoder T5XXL alone (5B parameters)(it also uses clip-l) is larger than SDXL plus its text encoders (clip-l and clip-g), and SDXL still often outperforms it in understanding. The 2B SD3 (referred to as SD3 medium) is a mess that is far worse than SD1.5 (which uses clip-l and clip-g which are also tiny) at a large number of things (SD3 uses T5XXL and clip-l like Flux plus clip-g) including lacking the understanding of certain classes of prompts that make it borderline unusable despite dramatically higher image quality when the stars align than the larger SDXL. Scale is often useless for fixing specific problems of understanding. SD3 and and FLUX (different companies but many of the same personnel and a similar in approach) are internally closer to being LLMs themselves than previous image generation, and the switch has caused a lot of problems scale certainly isn’t fixing. (SD3 actually has higher image quality when things work out well than the two variants of Flux I have used.) I’ve largely gone back to SDXL because I’m sick of Flux1′s flaws in realistic pictures (and SD3 remains almost unusable).
Sorry, I should have been clearer. I agree it’s straightforward in cases like the ones you give, I’m really thinking of the case of large language models. It’s not at all clear to me that we even have a good way to identify in- vs out-of-distribution for a model trained against much of the internet. If we did, some of this stuff would be much easier to test.
What would constitute a (minimal-ish) definitive test in your view?
And how do you expect the proposed experiment to go? Would you expect current-generation LLMs to fail completely, or to succeed for simple but not complex cases, or to have an easy time with it?
Absolutely; this is a huge weakness of much of the existing research trying to test the limitations of LLMs with respect to general reasoning ability, and a large motivation for the experiment (which has just been accepted for the next session of AI Safety Camp; if things go as expected I’ll be leading a research team on this experiment).
I’m not sure what it would mean for something not in the training data to be obfuscated. Obfuscated relative to what? In any case, my aim is very much to test something that’s definitively not in the training data, because it’s been randomly generated and uses novel words.
Sure, I only mean that there’s a strong correlation, not that there’s a perfect correspondence.
I think it’s important to distinguish error rate on the loss function, which pretty reliably decreases with scale, from other measures like ‘Does it make better art?‘, which a) quite plausibly don’t improve with scale since they’re not not what the model’s being trained on, and b) are very much harder to judge. Even ‘Is the skin plasticky or unrealistic?’ seems tricky (though not impossible) to judge without a human labeler.
Of course, one of the main causes of confusion is that ‘Is it good at general reasoning?’ is also a hard-to-judge question, and although it certainly seems to have improved significantly with scale, it’s hard to show that in a principled way. The experiment I describe is designed to at least get at a subset of that in a somewhat more principled way: can the models develop hypotheses in novel domains, figure out experiments that will test those hypotheses, and come to conclusions that match the underlying ground truth?
What would be a minimal-ish definitive test for LLM style AI? I don’t really know. I could come up with tests for it most likely, but I don’t really know how to make them fairly minimal. I can tell you that current AI isn’t intelligent, but as for what would prove intelligence, I’ve been thinking about it for a while and I really don’t have much. I wish I could be more helpful.
I do think your test of whether an AI can follow the scientific method in a novel area is intriguing.
Historically, a lot of people have come up with (in retrospect) really dumb tests (like Chess playing) that they assumed would be this because they didn’t really understand how AI would work, and this doesn’t seem to have abated with the switch to deep learning. I don’t want to do that, and thus I am reluctant to try (another problem with comparing human intelligence to machine intelligence). This is complicated in part because we really don’t even understand the nature of human intelligence, much less general intelligence in the abstract.
In theory, it is simple, but there is no single test that is necessarily robust to things like being in the training data because someone decided on that particular (which has happened many times when someone pointed out a particular flaw, but the particular test needn’t be included for that reason) so it would need to be tested across a number of different areas, and they all need to be genuinely hard if it doesn’t have the capability. Obviously the exact test items being held in reserve is useful, but I don’t think it can rule out being included since there are an awful lot of people making training data due to the way these are trained. Obfuscation does help, but I wouldn’t rule out it figuring out how to deobfuscate things without being generally intelligent (humans are not great generators of problems).
More limited specific tests are easier to design. We can programmatically create effectively infinite math problems to test and as long as the generator produces a notably different distribution of problems we know it has learned math when it does well… but that only tests whether it can do math and they can create effectively infinite math for the training as well.
Perhaps if you could genuinely exclude all data during training that in any way has to do with a certain scientific discovery from training you could check how well it discerns the real rule from plausible alternative rules when asked, but the best way to do that takes a very long time (waiting for scientific discoveries that weren’t even theorized correctly at the time it was trained), and the other ways of doing it have been shown to be leaky.
The best non minimal way is to introduce it to entirely new domains where it has not been trained at all, but that requires controlling the training very tightly and may not be useful as an external metric. For instance, train it on only numbers and addition (or for bonus points, only explain addition in terms of the succession of numbers on the number line) mathematically, then explain multiplication in terms of addition and ask it to do a lot of complicated multiplication. If it does that well, explain division in terms of multiplication, and so on. See just how deep it can go and maintain correctness when you explain things only in terms of other things with just that single link. This is not an especially different idea than the one proposed, of course, but I would find it more telling. If it was good at this, then I think it would be worth looking into the level of intelligence it has more closely, but doing well here isn’t proof. (In other words, I think your test is a good start, just not proof.)
The problem is, all models are trained on math in general because of how the internet works so it needs to be these less well-defined areas where we can’t be certain whether or not the answers are in some way correct or flawed, and crucially, just how hard the problems really are. Is it failing/acing extremely difficult/trivial problems? Our intuitions on what is easy/hard seem built specifically for humans. (We aren’t entirely general intelligences as we appear to have many special purpose capabilities bolted on, like judging other humans.) Also, giving it access to math tools would be cheating, but people have already started integrating tools for things like that into LLMs.
LLMs are supposedly superhuman at next word prediction, so an interesting (though not telling) test for an LLM might be varying the amount of informational and intelligence requiring information there is in a completely novel text by an author they have never seen before, and seeing how well the LLM continues to predict the next word. If it remains at a similar level, there’s probably something worth looking closely at going in terms of reasoning. (This can of course be gamed by making it worse at next word prediction on low content stuff.) This is similar to verification set testing though, so there is some selection for this in what gets released.
For bonus points, a linguist could make up a bunch of very different full-fledged languages it hasn’t been exposed to using arbitrary (and unusual) rules of grammar and see how well it does on those tests in the new languages compared to an average human with just the same key to the languages (but this can’t just be a cipher, as those are reversible without intelligence once it has figured out how to deobfuscate things and I believe that plausibly doesn’t require intelligence exactly, though it would for a human.)
I forget what the term for this is (maybe ‘data-efficient’?), but the best single test of an area is to compare the total amount of training information given to the AI in training and prompt to the amount a human gets in that area to get to a certain level of ability across a variety of representative areas. LLMs currently do terribly at this, and we don’t have anyone even vaguely suggesting that even considering trying this at levels with as little training data as humans use would make any sense at all (and again, humans have some specific purpose capabilities built in, so this isn’t even a great test). We also don’t even know how much training data humans actually get… (I’ve seen people trying to ballpark it, but it didn’t seem credible at the time.)
I suspect that in your proposed test, modern AI would likely be able to solve the very easy questions, but would do quite badly on difficult ones. Problem is, I don’t know how easy should be expected to be solved. I am again reluctant to opine to strongly on this matter.
So, as you know, obfuscation is a method of hiding exactly what you are getting at. You can do this for things it already knows obviously, but you can also use whatever methods you use for generating a obfuscations of known data on the novel data you generated. I would strongly advise testing on known data as a comparison.
This is to test how much of the difficulty is based on the form of the question rather than the content. Or in other words, using the same exact words and setup, have completely unknown things, and completely known things asked about. (You can check how well it knows an area using the nonobfuscated stuff.) For bonus points, see how well it does on things where it already struggles just a little in plain English too.
On another note, I do believe that image generation models are specifically being trained these days to be better at both aesthetics and realism, and are simply failing to move the needle sufficiently as they grow larger. I do agree that even the ‘skin test’ isn’t really super objective (since it is testing vs the parts that humans probably have built in which likely have some skew, and a human doesn’t want to judge thousands of pictures a day on such a matter, while using an AI to judge AI really is quite error prone.).
Thanks for the lengthy and thoughtful reply!
I’m planning to make a LW post soon asking for more input on this experiment—one of my goals here is to make this experiment one that both sides of the debate agree in advance would provide good evidence. I’d love to get your input there as well if you’re so moved!
I tend not to think of intelligence as a boolean property, but of an entity having some level of intelligence (like IQ, although we certainly can’t blithely give IQ tests to LLMs and treat the results as meaningful, not that that stops people from doing it). I don’t imagine you think of it as boolean either, but calling that out in case I’m mistaken.
Agreed; at this point I assume that anything published before (or not long after) the knowledge cutoff may well be in the training data.
The obfuscation method matters as well; eg I think the Kambhampati team’s approach to obfuscation made the problems much harder in ways that are irrelevant or counterproductive to testing LLM reasoning abilities (see Ryan’s comment here and my reply for details).
I’d absolutely love that and agree it would help enormously to resolve these sorts of questions. But my guess is we won’t see deliberate exclusions on frontier LLMs anytime in the next couple of years; it’s difficult and labor-intensive to do at internet scale, and the leading companies haven’t shown any interest in doing so AFAIK (or even in releasing comprehensive data about what the training data was).
Very interesting idea! I think I informally anectested something similar at one point by introducing new mathematical operations (but can’t recall how it turned out). Two questions:
Since we can’t in practice train a frontier LLM without multiplication, would artificial new operations be equally convincing in your view (eg, I don’t know,
x # y
meanssqrt(x - 2y)
? Ideally something a bit less arbitrary than that, though mathematicians tend to already write about the non-arbitrary ones).Would providing few-shot examples (eg several demonstrations of
x # y
for particular values of x and y) make it less compelling?It’s fun to confirm that for yourself :)
Sorry, I’m failing to understand the test you’re proposing; can you spell it out a bit more?
I found DeepMind’s experiment in teaching Gemini the Kalamang language (which it had never or barely encountered in the training data) really intriguing here, although not definitive evidence of anything (see section 5.2.2.1 of their Gemini paper for details).
From my point of view, sample efficiency is interesting but not that relevant; a model may have needed the equivalent of a thousand years of childhood to reach a certain level of intelligence, but the main thing I’m trying to investigate is what that level of intelligence is, regardless of how it got there.
My intuition is similar, that it should be able to solve them up to a certain level of difficulty (and I also expect that the difficulty level they can manage correlates pretty well with model size). But as I see it, that’s exactly the core point under debate—are LLM limitations along these lines a matter of scale or a fundamental flaw in the entire LLM approach?
Interesting point, thanks. I don’t think of the experiment as ultimately involving obfuscated data as much as novel data (certainly my aim is for it to be novel data, except insofar as it follows mathematical laws in a way that’s in-distribution for our universe), but I agree that it would be interesting and useful to see how the models do on a similar but known problem (maybe something like the gas laws). I’ll add that to the plan.
Thanks again for your deep engagement on this question! It’s both helpful and interesting to get to go into detail on this issue with someone who holds your view (whereas it’s easy to find people to fully represent the other view, and since I lean somewhat toward that view myself I think I have a pretty easy time representing the arguments for it).
Note that I am, in general, reluctant to claim to know how I will react to evidence in the future. There are things so far out there that I do know how I would react, but I like to allow myself to use all the evidence I have at that point, and not what I thought beforehand. I do not currently know enough about what would convince me of intelligence in an AI to say for sure. (In part because many people before me have been so obviously wrong.)
I wouldn’t say I see intelligence as a boolean, but as many valued… but those values include a level below which there is no meaningful intelligence (aka, not intelligent). This could be simplified to trinary, not binary. Not intelligent vs sort of intelligent vs genuinely intelligent. A rock… not intelligent. An abacus… not intelligent. A regular computer… not intelligent. Every program I have ever personally written, definitely not intelligent. Almost everyone agrees on those. There is a lot more disagreement about LLMs and other modern AI, but I’d still say they aren’t. (Sort of intelligent might include particularly advanced animals but I am unsure. I’ve certainly heard plenty of claims about it.)
I do think some of them can be said to understand certain things to a shallow degree despite not being intelligent, like how LLMS understand what I am asking them to do if I write something in Korean asking it to answer a particular question in English (or vice versa, I tested both when LLMs became a real thing because I am learning Korean and LLMs do often do it well even back when I tested it), or if I tell an image generation AI that I want a photo most understand what set of features make something photographic (if well trained).
Perhaps it should be noted that I think it requires either very deep understanding of something reasonably broad or notably general intelligence to count as intelligence? This is part of my definition. I generally think people should use the same definitions as each other in these discussions, but it should be the correct one and that is hard in this case since people do not understand intelligence deeply enough to have a great definition, even when we are just talking about humans. (Sometimes I barely think I qualify as intelligent, especially when reading math or AI papers, but that is a completely different definition. How we are defining it matters.)
I am highly unlikely to consider a tool AI to be intelligent, especially since I know it doesn’t understand much about things in general. I am utterly convinced that LLMs are simple tool AI at present, as are other AIs in general use. Modern tool AI might as well just be a very complicated program I wrote as far as intelligence goes according to me.
I actually see ‘neural nets’ as creating a lossy compression scheme using the data provided for their training, but then you supply a novel key during inference that wasn’t actually part of the data and see what happens. I have heard of people getting similar results just using mechanistic schemes of certain parts of normal lossless compression as well, though even more inefficiently. (Basically, you are making a dictionary based on the training data.) Gradient descent seems to allow very limited movement near real data to still make sense and that is what most of the other advancements involved seem to be for as well.
Generally when testing things like AI for intelligence, we seem to either serve up the easiest or hardest questions, because we either want them to fail or succeed based on our own beliefs. And I agree that the way something is obfuscated matters a lot to the difficulty of the question post obfuscation. The questioner is often at fault for how results turn out whether or not the thing being questioned is intelligent enough to answer in a neutral setting. (This is true when humans question humans as well.)
I don’t find arbitrary operations as compelling. The problem with arbitrary operations is the obvious fact that they don’t make sense. Under some definitions of intelligence that matters a lot. (I don’t personally know if it does.) Plus, I don’t know how to judge things perfectly (I’m overly perfectionist in attitude, even though I’ve realized it is impossible) if they are arbitrary except in trivial cases where I can just tell a computer the formula to check. That’s why I like the rediscovery stuff.
Can you make the arbitrary operations fit together perfectly in a sequence like numbers → succession → addition → multiplication in a way that we can truly know works? And then explain why it works clearly in few lines? If so, that is much better evidence. (That’s actually an interesting idea. LLMs clearly understand human language if they understand anything, so they should be able to do it based off of your explanation to humans if they are intelligent and a human would get it. Write up an article about the succession, with how it makes sense, and then ask questions that extend it in the obvious way.)
There could be a way in which its wrong answer, or the right answer, was somehow included in the question and I don’t know about it because I am not superhuman at next word prediction (obviously, and I don’t even try). Modern AI has proven itself quite capable at reading into word choice (if it understands anything well, that would be it), and we could get it to answer correctly like ‘clever Hans’ by massaging the question even subconsciously. (I’m sure this has been pointed out by many people.)
I still do think that these arbitrary operations are a good start, just not definitive. Honestly, in some ways the problem with arbitrary operations is that they are too hard, and thus more a problem for human memory and knowledge at a given difficulty than of intelligence. If an LLM was actually intelligent, it would be a different kind of intelligence, so we’d have a hard time gauging the results.
So, I think the test where you didn’t know what I was getting at is written in a somewhat unclear manner. Think of it in terms of a sequence of completions that keep getting both more novel and more requiring of intelligence for other reasons? (Possibly separately.) How does it perform on rote word completion? Compare that to how it performs on things requiring a little understanding. Then a little more. Up until you reach genuinely intellectually challenging and completely novel ideas. How does its ability to complete these sentences change as it requires more understanding of the world of thoughts and ideas rather than just sentence completion? Obviously, it will get worse, but how does it compare to humans on the level of change? Since it is superhuman on sentence completion, if at any time it does worse than a human, it seems like good evidence that it is reaching its limit.
One thing I think should be done more for AIs is give them actual reference materials like dictionaries or the grammar manual in the paper you mentioned. In fact, I think that the AI should be trained to write those itself. (I’m sure some people do that. It is not the same as what o1 is doing, because o1′s approach is far too primitive and short term.)
I do have a major problem with taking the Gemini paper at face value, because each paper in AI makes claims that turn out to be overstated (this is probably normal in all fields, but I don’t read many outside of AI, and those are mostly just specific math.) They all sound good, but turn out to be not what is claimed. (That said, LLMs really are good at translation, though for some reason google translate doesn’t actually work all that well when used for more than a short phrase, which is funny considering the claim in the paper is for a google AI.) For some reason google AI can’t do Korean well, for instance. (I haven’t tried Gemini as I got bored of trying LLMs by then.)
From reading their description, I am not entirely sure what their procedure of testing was. The writeup seems unclear. But if I’m reading it right, the setup is designed such that it makes it harder to be sure whether the machine translation is correct. Reference translations are a proxy, so in comparing the AI translation to it rather than the actual meaning there is a bunch of extra noise.
That said, the translation of Kalamang from a grammar book and dictionary is probably close enough to the kind of thing I was speculating on assuming there really wasn’t any in the training. Now it needs to be done a bunch of times by neutral parties. (Not me, I’m lazy, very skeptical, and not a linguist.) The included table looks like to me that it is actually dramatically inferior to human performance according human evaluations when translating from Kalamang (though relatively close on English to Kalamang). It is interesting.
Ignore sample-efficiency (is that the term?) at your own peril. While you are thinking about the training, I wasn’t really talking about the training, I was talking about how well it does on things for which it isn’t trained. When it comes across new information, how well does it integrate and use that when it has only seen a little bit or it is nonobviously related? This is sort of related to the few shot prompting. The fewer hints it needs to get up to a high level of performance for something it can’t do from initial training, the more likely it is to be intelligent. Most things in the world are still novel to the AI despite the insane amount of things it saw in training, which is why it makes so many mistakes. We know it can do the things it has seen a billion times (possibly literally) in its training, and that is uninteresting.
I’m glad you think this has been a valuable exchange, because I don’t think I’ve written my points very well. (Both too long and unclear for other reasons at the same time.) I have a feeling that everything I’ve said could be much clearer. (Also, given how much I wrote, a fair bit is probably wrong.) It has been interesting to me responding to your posts and having to actually think through what I think. It’s easy to get lazy when thinking about things just by myself.
Interesting, if you happen to have a link I’d be interested to learn more.
I like the idea, but it seems hard to judge ‘more novel and [especially] more requiring of intelligence’ other than to sort completions in order of human error on each.
I think there’s a lot of work to be done on this still, but there’s some evidence that in-context learning is essentially equivalent to gradient descent (though also some criticism of that claim).
I continue to think so :). Thanks again!
Sorry, I don’t have a link for using actual compression algorithms, it was a while ago. I didn’t think it would come up so I didn’t note anything down. My recent spate of commenting is unusual for me (and I don’t actually keep many notes on AI related subjects).
I definitely agree that it is ‘hard to judge’ ‘more novel and more requiring of intelligence’. It is, after all, a major thing we don’t even know how to clearly solve for evaluating other humans (so we use tricks that often rely on other things and these tricks likely do not generalize to other possible intelligences and thus couldn’t use here). Intelligence has not been solved.
Still, there is a big difference between the level of intelligence required when discussing how great your favorite popstar is vs what in particular they are good at vs why they are good at it (and within each category there are much more or less intellectual ways to write about it, though intellectual should not be confused with intelligent). It would have been nice if I could think up good examples, but I couldn’t. You could possibly check things like how well it completes things like parts of this conversation (which is somewhere in the middle).
I wasn’t really able to properly evaluate your links. There’s just too much they assume that I don’t know.
I found your first link, ‘Transformers Learn In-Context by Gradient Descent’ a bit hard to follow (though I don’t particularly think it is a fault of the paper itself). Once they get down to the details, they lose me. It is interesting that it would come up with similar changes based on training and just ‘reading’ the context, but if both mechanisms are simple, I suppose that makes sense.
Their claim about how in context can ‘curve’ better also reminds me of the ODEs used for samplers in diffusion models (I’ve written a number of samplers for diffusion models as a hobby/ to work on my programming). Higher degree ODEs curve more too (though they have their own drawbacks and particularly high degree is generally a bad idea) by using extra samples, just like this can use extra layers. Gradient descent is effectively first degree by default, right? So it wouldn’t be a surprise if you can curve more than it. You would expect sufficiently general things to resemble each other of course. I do find it a bit strange just how similar the loss for steps of gradient descent and transformer layers is. (Random point: I find that loss is not a very good metric for how good the actual results are at least in image generation/reconstruction. Not that I know of a good replacement. People do often come up with various different ways of measuring it though.)
Even though I can’t critique the details, I do think it is important to note that I often find claims of similarity like this in areas I understand better to not be very illuminating because people want to find similarities/analogies to understand it more easily.
The graphs really are shockingly similar though in the single layer case, which raises the likelihood that there’s something to it. And the multi-layer ones really does seem like simply a higher degree polynomial ODE.
The second link ‘In-context Learning and Gradient Descent Revisited’, which was equally difficult, has this line “Surprisingly, we find that untrained models achieve similarity scores at least as good as trained ones. This result provides strong evidence against the strong ICL-GD correspondence.” Which sounds pretty damning to me, assuming they are correct (which I also can’t evaluate).
I could probably figure them out, but I expect it would take me a lot of time.
Agreed, that’s definitely a general failure mode.
Hi, apologies for having failed to respond; I went out of town and lost track of this thread. Reading back through what you’ve said. Thank you!
No problem with the failure to respond. I appreciate that this way of communicating is asynchronous (and I don’t necessarily reply to things promptly either). And I think it would be reasonable to drop it at any point if it didn’t seem valuable.
Also, you’re welcome.
Nicky Case points me to ‘Emergent Analogical Reasoning in Large Language Models’, from Webb et al, late 2022, which claims that GPT-3 does better than human on a version of Raven’s Standard Progressive Matrices, often considered one of the best measures of non-verbal fluid intelligence. I somewhat roll to disbelieve, because that would seem to conflict with eg LLMs’ poor performance on ARC-AGI. There’s been some debate about the paper:
Response: Emergent analogical reasoning in large language models (08/23)
Evidence from counterfactual tasks supports emergent analogical reasoning in large language models (04/24)
(different thread of disagreement)
On Analogy-Making in Large Language Models (Melanie Mitchell, 01⁄23)
Response to “On Analogy-Making in Large Language Models” (03/23)
I find Mitchell’s point pretty strong here:
I found the arguments in Response: Emergent analogical reasoning in large language models somewhat weaker on the whole, and in particular I think rearranging the alphabet on the fly (section 7.1 & appendix 7.1) is fundamentally hard to deal with for LLMs and so doesn’t cleanly measure general reasoning. Their argument that some of this may be in the training data does seem reasonable to me.
Overall I’m left somewhat skeptical of the claims from the Webb et al paper, but it’s at least a bit of possible evidence on general reasoning.
When world chess champion Anand won arguably his best and most creative game, with black, against Aronian, he said in an interview afterward “yeah it’s no big deal the position was the same as in [slightly famous game from 100 years ago]”.
Of course the similarity is only visible for genius chess players.
So maybe pattern matching and novel thinking are, in fact, the same thing.
Added to the ‘Evidence for generality’ section after discovering this paper:
Jacob Steinhardt on predicting emergent capabilities:
The nature of these things is that they’re hard to predict, but general reasoning satisfies both criteria, making it a prime candidate for a capability that will emerge with scale.
I think the trouble with that argument is that it seems equally compelling for any useful capabilities, regardless of how achievable they are (eg it works for ‘deterministically simulate the behavior of the user’ even though that seems awfully unlikely to emerge in the foreseeable future). So I don’t think we can take it as evidence that we’ll see general reasoning emerge at any nearby scale.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
I think you don’t mean this literally as the paper linked does not argue for this actual position. Can you clarify exactly what you mean?
No, I mean that quite literally. From Appendix B of ‘Climbing Towards NLU’:
‘Tasks like DROP (Dua et al., 2019) require interpretation of language into an external world; in the case of DROP, the world of arithmetic. To get a sense of how existing LMs might do at such a task, we let GPT-2 complete the simple arithmetic problem Three plus five equals. The five responses below, created in the same way as above, show that this problem is beyond the current capability of GPT-2, and, we would argue, any pure LM.‘
I found it because I went looking for falsifiable claims in ‘On the Dangers of Stochastic Parrots’ and couldn’t really find any, so I went back further to ‘Climbing Towards NLU’ on Ryan Greenblatt and Gwern’s recommendation.
Oh wow—missed that. Thanks!