Given that I think LLMs don’t generalize, I was surprised how compelling Aschenbrenner’s case sounded when I read it (well, the first half of it. I’m short on time...). He seemed to have taken all the same evidence I knew about it, and arranged it into a very different framing. But I also felt like he underweighted criticism from the likes of Gary Marcus. To me, the illusion of LLMs being “smart” has been broken for a year or so.
As someone who has been studying LLM outputs pretty intently since GPT-2, I think you are mostly right but that the details do matter here.
The LLMs give a very good illusion of being smart, but are actually kinda dumb underneath. Yes. But… with each generation they get a little less dumb, a little more able to reason and extrapolate. The difference between ‘bad’ and ‘bad, but not as bad as they used to be, and getting rapidly better’ is pretty important.
They are also bad at ‘integrating’ knowledge. This results in having certain facts memorized, but getting questions where the answer is indicated by those facts wrong when the questions come from an unexpected direction. I haven’t noticed steady progress on factual knowledge integration in the same way I have with reasoning. I do expect this hurdle will be overcome eventually. Things are progressing quite quickly, and I know of many advances which seem like compatible pareto improvements which have not yet been integrated into the frontier models because the advances are too new.
Also, I notice that LLMs are getting gradually better at being coding assistants and speeding up my work. So I don’t think it’s necessarily the case that we need to get all the way to full human-level reasoning before we get substantial positive feedback effects on ML algorithm development rate from improved coding assistance.
I’m having trouble discerning a difference between our opinions, as I expect a “kind-of AGI” to come out of LLM tech, given enough investment. Re: code assistants, I’m generally disappointed with Github Copilot. It’s not unusual that I’m like “wow, good job”, but bad completions are commonplace, especially when I ask a question in the sidebar (which should use a bigger LLM). Its (very hallucinatory) response typically demonstrates that it doesn’t understand our (relatively small) codebase very well, to the point where I only occasionally bother asking. (I keep wondering “did no one at GitHub think to generate an outline of the app that could fit in the context window?”)
Yes, I agree our views are quite close. My expectations closely match what you say here:
Although LLMs badly suck at reasoning, my AGI timelines are still kinda short―roughly 1 to 15 years for “real” AGI, with quasi-AGI in 2 to 6 years―mainly because so much funding is going into this, and because only one researcher needs to figure out the secret, and because so much research is being shared publicly, and because there should be many ways to do AGI, and because quasi-AGI (if invented first) might help create real AGI.
Basically I just want to point out that the progression of competence in recent models seems pretty impressive, even though the absolute values are low.
For instance, for writing code I think the following pattern of models (including only ones I’ve personally tested enough to have an opinion) shows a clear trend of increasing competence with later release dates:
Github Copilot (pre-GPT-4) < GPT-4 (the first release) < Claude 3 Opus < Claude 3.5 Sonnet
Basically, I’m holding in my mind the possibility that the next versions (GPT-5 and/or Claude Opus 4) will really impress me. I don’t feel confident of that. I am pretty confident that the version after next will impress me (e.g. GPT-6 / Claude Opus 5) and actually be useful for RSI.
From this list, Claude 3.5 Sonnet is the first one to be competent enough I find it even occasionally useful. I made myself use the others just to get familiar with their abilities, but their outputs just weren’t worth the time and effort on average.
As someone who has been studying LLM outputs pretty intently since GPT-2, I think you are mostly right but that the details do matter here.
The LLMs give a very good illusion of being smart, but are actually kinda dumb underneath. Yes. But… with each generation they get a little less dumb, a little more able to reason and extrapolate. The difference between ‘bad’ and ‘bad, but not as bad as they used to be, and getting rapidly better’ is pretty important.
They are also bad at ‘integrating’ knowledge. This results in having certain facts memorized, but getting questions where the answer is indicated by those facts wrong when the questions come from an unexpected direction. I haven’t noticed steady progress on factual knowledge integration in the same way I have with reasoning. I do expect this hurdle will be overcome eventually. Things are progressing quite quickly, and I know of many advances which seem like compatible pareto improvements which have not yet been integrated into the frontier models because the advances are too new.
Also, I notice that LLMs are getting gradually better at being coding assistants and speeding up my work. So I don’t think it’s necessarily the case that we need to get all the way to full human-level reasoning before we get substantial positive feedback effects on ML algorithm development rate from improved coding assistance.
I’m having trouble discerning a difference between our opinions, as I expect a “kind-of AGI” to come out of LLM tech, given enough investment. Re: code assistants, I’m generally disappointed with Github Copilot. It’s not unusual that I’m like “wow, good job”, but bad completions are commonplace, especially when I ask a question in the sidebar (which should use a bigger LLM). Its (very hallucinatory) response typically demonstrates that it doesn’t understand our (relatively small) codebase very well, to the point where I only occasionally bother asking. (I keep wondering “did no one at GitHub think to generate an outline of the app that could fit in the context window?”)
Yes, I agree our views are quite close. My expectations closely match what you say here:
Basically I just want to point out that the progression of competence in recent models seems pretty impressive, even though the absolute values are low.
For instance, for writing code I think the following pattern of models (including only ones I’ve personally tested enough to have an opinion) shows a clear trend of increasing competence with later release dates:
Github Copilot (pre-GPT-4) < GPT-4 (the first release) < Claude 3 Opus < Claude 3.5 Sonnet
Basically, I’m holding in my mind the possibility that the next versions (GPT-5 and/or Claude Opus 4) will really impress me. I don’t feel confident of that. I am pretty confident that the version after next will impress me (e.g. GPT-6 / Claude Opus 5) and actually be useful for RSI.
From this list, Claude 3.5 Sonnet is the first one to be competent enough I find it even occasionally useful. I made myself use the others just to get familiar with their abilities, but their outputs just weren’t worth the time and effort on average.