Given that I think LLMs don’t generalize, I was surprised how compelling Aschenbrenner’s case sounded when I read it (well, the first half of it. I’m short on time...). He seemed to have taken all the same evidence I knew about it, and arranged it into a very different framing. But I also felt like he underweighted criticism from the likes of Gary Marcus. To me, the illusion of LLMs being “smart” has been broken for a year or so.
To the extent LLMs appear to build world models, I think what you’re seeing is a bunch of disorganized neurons and connections that, when probed with a systematic method, can be mapped onto things that we know a world model ought to contain. A couple of important questions are
the way that such a world model was formed and
how easily we can easily figure out how to form those models better/differently[1].
I think LLMs get “world models” (which don’t in fact cover the whole world) in a way that is quite unlike the way intelligent humans form their own world models―and more like how unintelligent or confused humans do the same.
The way I see it, LLMs learn in much the same way a struggling D student learns (if I understand correctly how such a student learns), and the reason LLMs sometimes perform like an A student is because they have extra advantages that regular D students do not: unlimited attention span and ultrafast, ultra-precise processing backed by an extremely large set of training data. So why do D students perform badly, even with “lots” of studying? I think it’s either because they are not trying to build mental models, or because they don’t really follow what their teachers are saying. Either way, this leads them to fall back on secondary “pattern-matching” learning mode which doesn’t depend on a world model.
If, when learning in this mode, you see enough patterns, you will learn an implicit world model. The implicit model is a proper world model in terms of predictive power, but
It requires much more training data to predict as well as a human system-2 can, which explains why D students perform worse than A students on the same amount of training data―and this is one of the reasons why LLMs need so much more training data than humans do in order to perform at an A level (other reasons: less compute per token, fewer total synapses, no ability to “mentally” generate training data, inability to autonomously choose what to train on). The way you should learn is to first develop an explicit worldmodel via system-2 thinking, then use system-2 to mentally generate training data which (along with external data) feeds into system-1. LLMs cannot do this.
Such a model tends to be harder to explain in words than an explicit world model, because the predictions are coming from system-1 without much involvement from system-2, and so much of the model is not consciously visible to the student, nor is it properly connected to its linguistic form, so the D student relies more on “feeling around” the system-1 model via queries (e.g. to figure out whether “citation” is a noun, you can do things like ask your system-1 whether “the citation” is a valid phrase―human language skills tend to always develop as pure system-1 initially, so a good linguistics course teaches you explicitly to perform these queries to extract information, whereas if you have a mostly-system-2 understanding of a language, you can use that to decide whether a phrase is correct with system-2, without an intuition about whether it’s correct. My system-1 for Spanish is badly underdeveloped, so I lean on my superior system-2/analytical understanding of grammar).
When an LLM cites a correct definition of something as if it were a textbook, then immediately afterward fails to apply that definition to the question you ask, I think that indicates the LLM doesn’t really have a world model with respect to that question, but I would go further and say that even if it has a good world model, it cannot express its world-model in words, it can only express the textbook definitions it has seen and then apply its implicit world-model, which may or may not match what it said verbally.
So if you just keep training it on more unique data, eventually it “gets it”, but I think it “gets it” the way a D student does, implicitly not explicitly. With enough experience, the D student can be competent, but never as good as similarly-experienced A students.
A corollary of the above is that I think the amount of compute required for AGI is wildly overestimated, if not by Aschenbrenner himself then by less nuanced versions of his style of thinking (e.g. Sam Altman). And much of the danger of AGI follows from this. On a meta level, my own opinions on AGI are mostly not formed via “training data”, since I have not read/seen that many articles and videos about AGI alignment (compared to any actual alignment researcher). No coincidence, then, that I was always an A to A- student, and the one time I got a C- in a technical course was when I couldn’t figure out WTF the professor was talking about. I still learned, that’s why I got a C-, but I learned in a way that seemed unnatural to me, but which incorporated some of the “brute force” that an LLM would use. I’m all about mental models and evidence; LLMs are about neither.
Aschenbrenner did help firm up my sense that current LLM tech leads to “quasi-AGI”: a competent humanlike digital assistant, probably one that can do some AI research autonomously. It appears that the AI industry (or maybe just OpenAI) is on an evolutionary approach of “let’s just tweak LLMs and our processes around them”. This may lead (via human ingenuity or chance discovery) to system-2s with explicit worldmodels, but without some breakthough, it just leads to relatively safe quasi-AGIs, the sort that probably won’t generate groundbreaking new cancer-fighting ideas but might do a good job testing ideas for curing cancer that are “obvious” or human-generated or both.
Although LLMs badly suck at reasoning, my AGI timelines are still kinda short―roughly 1 to 15 years for “real” AGI, with quasi-AGI in 2 to 6 years―mainly because so much funding is going into this, and because only one researcher needs to figure out the secret, and because so much research is being shared publicly, and because there should be many ways to do AGI, and because quasi-AGI (if invented first) might help create real AGI. Even the AGI safety people[2] might be the ones to invent AGI, for how else will they do effective safety research? FWIW my prediction is that quasi-AGI is consists of a transformer architecture with quite a large number of (conventional software) tricks and tweaks bolted on to it, while real AGI consists of transformer architecture plus a smaller number of tricks and tweaks, plus a second breakthrough of the same magnitude as transformer architecture itself (or a pair of ideas that work so well together that combining them counts as a breakthrough).
EDIT: if anyone thinks I’m on to something here, let me know your thoughts as to whether I should redact the post lest changing minds in this regard is itself hazardous. My thinking for now, though, is that presenting ideas to a safety-conscious audience might well be better than safetyists nodding along to a mental model that I think is, if not incorrect, then poorly framed.
why are we calling it “AI safety”? I think this term generates a lot of “the real danger of AI is bias/disinformation/etc” responses, which should decrease if we make the actual topic clear
Given that I think LLMs don’t generalize, I was surprised how compelling Aschenbrenner’s case sounded when I read it (well, the first half of it. I’m short on time...). He seemed to have taken all the same evidence I knew about it, and arranged it into a very different framing. But I also felt like he underweighted criticism from the likes of Gary Marcus. To me, the illusion of LLMs being “smart” has been broken for a year or so.
As someone who has been studying LLM outputs pretty intently since GPT-2, I think you are mostly right but that the details do matter here.
The LLMs give a very good illusion of being smart, but are actually kinda dumb underneath. Yes. But… with each generation they get a little less dumb, a little more able to reason and extrapolate. The difference between ‘bad’ and ‘bad, but not as bad as they used to be, and getting rapidly better’ is pretty important.
They are also bad at ‘integrating’ knowledge. This results in having certain facts memorized, but getting questions where the answer is indicated by those facts wrong when the questions come from an unexpected direction. I haven’t noticed steady progress on factual knowledge integration in the same way I have with reasoning. I do expect this hurdle will be overcome eventually. Things are progressing quite quickly, and I know of many advances which seem like compatible pareto improvements which have not yet been integrated into the frontier models because the advances are too new.
Also, I notice that LLMs are getting gradually better at being coding assistants and speeding up my work. So I don’t think it’s necessarily the case that we need to get all the way to full human-level reasoning before we get substantial positive feedback effects on ML algorithm development rate from improved coding assistance.
I’m having trouble discerning a difference between our opinions, as I expect a “kind-of AGI” to come out of LLM tech, given enough investment. Re: code assistants, I’m generally disappointed with Github Copilot. It’s not unusual that I’m like “wow, good job”, but bad completions are commonplace, especially when I ask a question in the sidebar (which should use a bigger LLM). Its (very hallucinatory) response typically demonstrates that it doesn’t understand our (relatively small) codebase very well, to the point where I only occasionally bother asking. (I keep wondering “did no one at GitHub think to generate an outline of the app that could fit in the context window?”)
Yes, I agree our views are quite close. My expectations closely match what you say here:
Although LLMs badly suck at reasoning, my AGI timelines are still kinda short―roughly 1 to 15 years for “real” AGI, with quasi-AGI in 2 to 6 years―mainly because so much funding is going into this, and because only one researcher needs to figure out the secret, and because so much research is being shared publicly, and because there should be many ways to do AGI, and because quasi-AGI (if invented first) might help create real AGI.
Basically I just want to point out that the progression of competence in recent models seems pretty impressive, even though the absolute values are low.
For instance, for writing code I think the following pattern of models (including only ones I’ve personally tested enough to have an opinion) shows a clear trend of increasing competence with later release dates:
Github Copilot (pre-GPT-4) < GPT-4 (the first release) < Claude 3 Opus < Claude 3.5 Sonnet
Basically, I’m holding in my mind the possibility that the next versions (GPT-5 and/or Claude Opus 4) will really impress me. I don’t feel confident of that. I am pretty confident that the version after next will impress me (e.g. GPT-6 / Claude Opus 5) and actually be useful for RSI.
From this list, Claude 3.5 Sonnet is the first one to be competent enough I find it even occasionally useful. I made myself use the others just to get familiar with their abilities, but their outputs just weren’t worth the time and effort on average.
P.S. if I’m wrong about the timeline―if it takes >15 years―my guess for how I’m wrong is (1) a major downturn in AGI/AI research investment and (2) executive misallocation of resources. I’ve been thinking that the brightest minds of the AI world are working on AGI, but maybe they’re just paid a lot because there are too few minds to go around. And when I think of my favorite MS developer tools, they have greatly improved over the years, but there are also fixable things that haven’t been fixed in 20 years, and good ideas they’ve never tried, and MS has created a surprising number of badly designed libraries (not to mention products) over the years. And I know people close to Google have a variety of their own pet peeves about Google.
Are AGI companies like this? Do they burn mountains cash to pay otherwise average engineers who happen to have AI skills? Do they tend to ignore promising research directions because the results are uncertain, or because results won’t materialize in the next year, or because they don’t need a supercomputer or aren’t based mainly on transformers? Are they bad at creating tools that would’ve made the company more efficient? Certainly I expect some companies to be like that.
As for (1), I’m no great fan of copyright law, but today’s companies are probably built on a foundation of rampant piracy, and litigation might kill investment. Or, investors may be scared away by a persistent lack of discoveries to increase reliability / curtail hallucinations.
Thanks for your comments! I was traveling and missed them until now.
To the extent LLMs appear to build world models, I think what you’re seeing is a bunch of disorganized neurons and connections that, when probed with a systematic method, can be mapped onto things that we know a world model ought to contain.
I think we’ve certainly seen some examples of interpretability papers that ‘find’ things in the models that aren’t there, especially when researchers train nonlinear probes. But the research community has been learning over time to distinguish cases like that from from what’s really in the model (ablation, causal tracing, etc). We’ve also seen examples of world modeling that are clearly there in the model; Neel Nanda’s work finding a world model in Othello-GPT is a particularly clear case in my opinion (post, paper).
I think LLMs get “world models” (which don’t in fact cover the whole world) in a way that is quite unlike the way intelligent humans form their own world models―and more like how unintelligent or confused humans do the same.
My intuitions about human learning here are very different from yours, I think. In my view, learning (eg) to produce valid sentences in a native language and to understand sentences from other speakers is very nearly the only thing that matters, and that’s something nearly all speakers achieve. Learning an explicit model for that language, in order to eg produce a correct parse tree, matters a tiny bit, very briefly, when you learn parse trees in school. Rather than intelligent humans learning a detailed explicit model of their language and unintelligent humans not doing so, it seems to me that very few intelligent humans have such a model. Mostly it’s just linguists, who need an explicit model. I would further claim that those who do learn an explicit model don’t end up being significantly better at producing and understanding language in their day-to-day lives; it’s not explicit modeling that makes us good at that.
I do agree that someone without an explicit model of a topic will often have a harder time explaining that topic to someone else, and I agree that LLMs typically learn implicit rather than explicit models. I just don’t think that that in and of itself makes them worse at using those models.
That said, to the extent that by ‘general reasoning’ we mean chains of step-by-step assertions with each step explicitly justified by valid rules of reasoning, that does seem like something that benefits a lot from an explicit model. So in the end I don’t necessarily disagree with your application of this idea to at least some versions of general reasoning; I do disagree when it comes to other sorts of general reasoning, and LLM capabilities in general.
Given that I think LLMs don’t generalize, I was surprised how compelling Aschenbrenner’s case sounded when I read it (well, the first half of it. I’m short on time...). He seemed to have taken all the same evidence I knew about it, and arranged it into a very different framing. But I also felt like he underweighted criticism from the likes of Gary Marcus. To me, the illusion of LLMs being “smart” has been broken for a year or so.
To the extent LLMs appear to build world models, I think what you’re seeing is a bunch of disorganized neurons and connections that, when probed with a systematic method, can be mapped onto things that we know a world model ought to contain. A couple of important questions are
the way that such a world model was formed and
how easily we can easily figure out how to form those models better/differently[1].
I think LLMs get “world models” (which don’t in fact cover the whole world) in a way that is quite unlike the way intelligent humans form their own world models―and more like how unintelligent or confused humans do the same.
The way I see it, LLMs learn in much the same way a struggling D student learns (if I understand correctly how such a student learns), and the reason LLMs sometimes perform like an A student is because they have extra advantages that regular D students do not: unlimited attention span and ultrafast, ultra-precise processing backed by an extremely large set of training data. So why do D students perform badly, even with “lots” of studying? I think it’s either because they are not trying to build mental models, or because they don’t really follow what their teachers are saying. Either way, this leads them to fall back on secondary “pattern-matching” learning mode which doesn’t depend on a world model.
If, when learning in this mode, you see enough patterns, you will learn an implicit world model. The implicit model is a proper world model in terms of predictive power, but
It requires much more training data to predict as well as a human system-2 can, which explains why D students perform worse than A students on the same amount of training data―and this is one of the reasons why LLMs need so much more training data than humans do in order to perform at an A level (other reasons: less compute per token, fewer total synapses, no ability to “mentally” generate training data, inability to autonomously choose what to train on). The way you should learn is to first develop an explicit worldmodel via system-2 thinking, then use system-2 to mentally generate training data which (along with external data) feeds into system-1. LLMs cannot do this.
Such a model tends to be harder to explain in words than an explicit world model, because the predictions are coming from system-1 without much involvement from system-2, and so much of the model is not consciously visible to the student, nor is it properly connected to its linguistic form, so the D student relies more on “feeling around” the system-1 model via queries (e.g. to figure out whether “citation” is a noun, you can do things like ask your system-1 whether “the citation” is a valid phrase―human language skills tend to always develop as pure system-1 initially, so a good linguistics course teaches you explicitly to perform these queries to extract information, whereas if you have a mostly-system-2 understanding of a language, you can use that to decide whether a phrase is correct with system-2, without an intuition about whether it’s correct. My system-1 for Spanish is badly underdeveloped, so I lean on my superior system-2/analytical understanding of grammar).
When an LLM cites a correct definition of something as if it were a textbook, then immediately afterward fails to apply that definition to the question you ask, I think that indicates the LLM doesn’t really have a world model with respect to that question, but I would go further and say that even if it has a good world model, it cannot express its world-model in words, it can only express the textbook definitions it has seen and then apply its implicit world-model, which may or may not match what it said verbally.
So if you just keep training it on more unique data, eventually it “gets it”, but I think it “gets it” the way a D student does, implicitly not explicitly. With enough experience, the D student can be competent, but never as good as similarly-experienced A students.
A corollary of the above is that I think the amount of compute required for AGI is wildly overestimated, if not by Aschenbrenner himself then by less nuanced versions of his style of thinking (e.g. Sam Altman). And much of the danger of AGI follows from this. On a meta level, my own opinions on AGI are mostly not formed via “training data”, since I have not read/seen that many articles and videos about AGI alignment (compared to any actual alignment researcher). No coincidence, then, that I was always an A to A- student, and the one time I got a C- in a technical course was when I couldn’t figure out WTF the professor was talking about. I still learned, that’s why I got a C-, but I learned in a way that seemed unnatural to me, but which incorporated some of the “brute force” that an LLM would use. I’m all about mental models and evidence; LLMs are about neither.
Aschenbrenner did help firm up my sense that current LLM tech leads to “quasi-AGI”: a competent humanlike digital assistant, probably one that can do some AI research autonomously. It appears that the AI industry (or maybe just OpenAI) is on an evolutionary approach of “let’s just tweak LLMs and our processes around them”. This may lead (via human ingenuity or chance discovery) to system-2s with explicit worldmodels, but without some breakthough, it just leads to relatively safe quasi-AGIs, the sort that probably won’t generate groundbreaking new cancer-fighting ideas but might do a good job testing ideas for curing cancer that are “obvious” or human-generated or both.
Although LLMs badly suck at reasoning, my AGI timelines are still kinda short―roughly 1 to 15 years for “real” AGI, with quasi-AGI in 2 to 6 years―mainly because so much funding is going into this, and because only one researcher needs to figure out the secret, and because so much research is being shared publicly, and because there should be many ways to do AGI, and because quasi-AGI (if invented first) might help create real AGI. Even the AGI safety people[2] might be the ones to invent AGI, for how else will they do effective safety research? FWIW my prediction is that quasi-AGI is consists of a transformer architecture with quite a large number of (conventional software) tricks and tweaks bolted on to it, while real AGI consists of transformer architecture plus a smaller number of tricks and tweaks, plus a second breakthrough of the same magnitude as transformer architecture itself (or a pair of ideas that work so well together that combining them counts as a breakthrough).
EDIT: if anyone thinks I’m on to something here, let me know your thoughts as to whether I should redact the post lest changing minds in this regard is itself hazardous. My thinking for now, though, is that presenting ideas to a safety-conscious audience might well be better than safetyists nodding along to a mental model that I think is, if not incorrect, then poorly framed.
I don’t follow ML research, so let me know if you know of proposed solutions already.
why are we calling it “AI safety”? I think this term generates a lot of “the real danger of AI is bias/disinformation/etc” responses, which should decrease if we make the actual topic clear
As someone who has been studying LLM outputs pretty intently since GPT-2, I think you are mostly right but that the details do matter here.
The LLMs give a very good illusion of being smart, but are actually kinda dumb underneath. Yes. But… with each generation they get a little less dumb, a little more able to reason and extrapolate. The difference between ‘bad’ and ‘bad, but not as bad as they used to be, and getting rapidly better’ is pretty important.
They are also bad at ‘integrating’ knowledge. This results in having certain facts memorized, but getting questions where the answer is indicated by those facts wrong when the questions come from an unexpected direction. I haven’t noticed steady progress on factual knowledge integration in the same way I have with reasoning. I do expect this hurdle will be overcome eventually. Things are progressing quite quickly, and I know of many advances which seem like compatible pareto improvements which have not yet been integrated into the frontier models because the advances are too new.
Also, I notice that LLMs are getting gradually better at being coding assistants and speeding up my work. So I don’t think it’s necessarily the case that we need to get all the way to full human-level reasoning before we get substantial positive feedback effects on ML algorithm development rate from improved coding assistance.
I’m having trouble discerning a difference between our opinions, as I expect a “kind-of AGI” to come out of LLM tech, given enough investment. Re: code assistants, I’m generally disappointed with Github Copilot. It’s not unusual that I’m like “wow, good job”, but bad completions are commonplace, especially when I ask a question in the sidebar (which should use a bigger LLM). Its (very hallucinatory) response typically demonstrates that it doesn’t understand our (relatively small) codebase very well, to the point where I only occasionally bother asking. (I keep wondering “did no one at GitHub think to generate an outline of the app that could fit in the context window?”)
Yes, I agree our views are quite close. My expectations closely match what you say here:
Basically I just want to point out that the progression of competence in recent models seems pretty impressive, even though the absolute values are low.
For instance, for writing code I think the following pattern of models (including only ones I’ve personally tested enough to have an opinion) shows a clear trend of increasing competence with later release dates:
Github Copilot (pre-GPT-4) < GPT-4 (the first release) < Claude 3 Opus < Claude 3.5 Sonnet
Basically, I’m holding in my mind the possibility that the next versions (GPT-5 and/or Claude Opus 4) will really impress me. I don’t feel confident of that. I am pretty confident that the version after next will impress me (e.g. GPT-6 / Claude Opus 5) and actually be useful for RSI.
From this list, Claude 3.5 Sonnet is the first one to be competent enough I find it even occasionally useful. I made myself use the others just to get familiar with their abilities, but their outputs just weren’t worth the time and effort on average.
P.S. if I’m wrong about the timeline―if it takes >15 years―my guess for how I’m wrong is (1) a major downturn in AGI/AI research investment and (2) executive misallocation of resources. I’ve been thinking that the brightest minds of the AI world are working on AGI, but maybe they’re just paid a lot because there are too few minds to go around. And when I think of my favorite MS developer tools, they have greatly improved over the years, but there are also fixable things that haven’t been fixed in 20 years, and good ideas they’ve never tried, and MS has created a surprising number of badly designed libraries (not to mention products) over the years. And I know people close to Google have a variety of their own pet peeves about Google.
Are AGI companies like this? Do they burn mountains cash to pay otherwise average engineers who happen to have AI skills? Do they tend to ignore promising research directions because the results are uncertain, or because results won’t materialize in the next year, or because they don’t need a supercomputer or aren’t based mainly on transformers? Are they bad at creating tools that would’ve made the company more efficient? Certainly I expect some companies to be like that.
As for (1), I’m no great fan of copyright law, but today’s companies are probably built on a foundation of rampant piracy, and litigation might kill investment. Or, investors may be scared away by a persistent lack of discoveries to increase reliability / curtail hallucinations.
Thanks for your comments! I was traveling and missed them until now.
I think we’ve certainly seen some examples of interpretability papers that ‘find’ things in the models that aren’t there, especially when researchers train nonlinear probes. But the research community has been learning over time to distinguish cases like that from from what’s really in the model (ablation, causal tracing, etc). We’ve also seen examples of world modeling that are clearly there in the model; Neel Nanda’s work finding a world model in Othello-GPT is a particularly clear case in my opinion (post, paper).
My intuitions about human learning here are very different from yours, I think. In my view, learning (eg) to produce valid sentences in a native language and to understand sentences from other speakers is very nearly the only thing that matters, and that’s something nearly all speakers achieve. Learning an explicit model for that language, in order to eg produce a correct parse tree, matters a tiny bit, very briefly, when you learn parse trees in school. Rather than intelligent humans learning a detailed explicit model of their language and unintelligent humans not doing so, it seems to me that very few intelligent humans have such a model. Mostly it’s just linguists, who need an explicit model. I would further claim that those who do learn an explicit model don’t end up being significantly better at producing and understanding language in their day-to-day lives; it’s not explicit modeling that makes us good at that.
I do agree that someone without an explicit model of a topic will often have a harder time explaining that topic to someone else, and I agree that LLMs typically learn implicit rather than explicit models. I just don’t think that that in and of itself makes them worse at using those models.
That said, to the extent that by ‘general reasoning’ we mean chains of step-by-step assertions with each step explicitly justified by valid rules of reasoning, that does seem like something that benefits a lot from an explicit model. So in the end I don’t necessarily disagree with your application of this idea to at least some versions of general reasoning; I do disagree when it comes to other sorts of general reasoning, and LLM capabilities in general.