This post taught me a lot about different ways of thinking about timelines, thanks to everyone involved!
I’d like to offer some arguments that, contra Daniel’s view, AI systems are highly unlikely to be able to replace 99% of current fully remote jobs anytime in the next 4 years. As a sample task, I’ll reference software engineering projects that take a reasonably skilled human practitioner one week to complete. I imagine that, for AIs to be ready for 99% of current fully remote jobs, they would need to be able to accomplish such a task. (That specific category might be less than 1% of all remote jobs, but I imagine that the class of remote jobs requiring at least this level of cognitive ability is more than 1%.)
Rather than referencing scaling laws, my arguments stem from analysis of two specific mechanisms which I believe are missing from current LLMs:
Long-term memory. LLMs of course have no native mechanism for retaining new information beyond the scope of their token buffer. I don’t think it is possible to carry out a complex extended task, such as a week-long software engineering project, without long-term memory to manage the task, keep track of intermediate thoughts regarding design approaches, etc.
Iterative / exploratory work processes. The LLM training process focuses on producing final work output in a single pass, with no planning process, design exploration, intermediate drafts, revisions, etc. I don’t think it is possible to accomplish a week-long software engineering task in a single pass; at least, not without very strongly superhuman capabilities (unlikely to be reached in just four years).
Of course there are workarounds for each of these issues, such as RAG for long-term memory, and multi-prompt approaches (chain-of-thought, tree-of-thought, AutoGPT, etc.) for exploratory work processes. But I see no reason to believe that they will work sufficiently well to tackle a week-long project. Briefly, my intuitive argument is that these are old school, rigid, GOFAI, Software 1.0 sorts of approaches, the sort of thing that tends to not work out very well in messy real-world situations. Many people have observed that even in the era of GPT-4, there is a conspicuous lack of LLMs accomplishing any really meaty creative work; I think these missing capabilities lie at the heart of the problem.
Nor do I see how we could expect another round or two of scaling to introduce the missing capabilities. The core problem is that we don’t have massive amounts of training data for managing long-term memory or carrying out exploratory work processes. Generating such data at the necessary scale, if it’s even possible, seems much harder than what we’ve been doing up to this point to marshall training data for LLMs.
The upshot is that I think that we have been seeing the rapid increase in capabilities of generative AI, failing to notice that this progress is confined to a particular subclass of tasks – namely, tasks which can pretty much be accomplished using System 1 alone – and collectively fooling ourselves into thinking that the trend of increasing capabilities is going to quickly roll through the remainder of human capabilities. In other words, I believe the assertion that the recent rate of progress will continue up through AGI is based on an overgeneralization. For an extended version of this claim, see a post I wrote a few months ago: The AI Progress Paradox. I’ve also written at greater length about the issues of Long-term memory and Exploratory work processes.
In the remainder of this comment, I’m going to comment what I believe are some weak points in the argument for short timelines (as presented in the original post).
[Daniel] It seems to me that GPT-4 is already pretty good at coding, and a big part of accelerating AI R&D seems very much in reach—like, it doesn’t seem to me like there is a 10-year, 4-OOM-training-FLOP gap between GPT4 and a system which is basically a remote-working OpenAI engineer that thinks at 10x serial speed.
Coding, in the sense that GPT4 can do it, is nowhere near the top of the hierarchy of skills involved in serious software engineering. And so I believe this is a bit like saying that, because a certain robot is already pretty decent at chiseling, it will soon be able to produce works of art at the same level as any human sculptor.
[Ajeya] I don’t know, 4 OOM is less than two GPTs, so we’re talking less than GPT-6. Given how consistently I’ve been wrong about how well “impressive capabilities in the lab” will translate to “high economic value” since 2020, this seems roughly right to me?
[Daniel] I disagree with this update—I think the update should be “it takes a lot of schlep and time for the kinks to be worked out and for products to find market fit” rather than “the systems aren’t actually capable of this.” Like, I bet if AI progress stopped now, but people continued to make apps and widgets using fine-tunes of various GPTs, there would be OOMs more economic value being produced by AI in 2030 than today.
If the delay in real-world economic value were due to “schlep”, shouldn’t we already see one-off demonstrations of LLMs performing economically-valuable-caliber tasks in the lab? For instance, regarding software engineering, maybe it takes a long time to create a packaged product that can be deployed in the field, absorb the context of a legacy codebase, etc. and perform useful high-level work. But if that’s the only problem, shouldn’t there already be at least one demonstration of an LLM doing some meaty software engineering project in a friendly lab environment somewhere?
More generally, how do we define “schlep” such that the need for schlep explains the lack of visible accomplishments today, but also allows for AI systems be able to replace 99% of remote jobs within just four years?
[Daniel] And so I think that the AI labs will be using AI remote engineers much sooner than the general economy will be. (Part of my view here is that around the time it is capable of being a remote engineer, the process of working out the kinks / pushing through schlep will itself be largely automatable.)
What is your definition of “schlep”? I’d assumed it referred to the innumerable details of figuring out how to adapt and integrate a raw LLM into a finished product which can handle all of the messy requirements of real-world use cases – the “last mile” of unspoken requirements and funky edge cases. Shouldn’t we expect such things to be rather difficult to automate? Or do you mean something else by “schlep”?
[Daniel] …when I say 2027 as my median, that’s kinda because I can actually quite easily see it happening in 2025, but things take longer than I expect, so I double it.
Can you see LLMs acquiring long-term memory and an expert-level, nuanced ability to carry out extended exploratory processes by 2025? If yes, how do you see that coming about? If no, does that cause you to update at all?
[Daniel] I take it that in this scenario, despite getting IMO gold etc. the systems of 2030 are not able to do the work of today’s OAI engineer? Just clarifying. Can you say more about what goes wrong when you try to use them in such a role?
Anecdote: I got IMO silver (granted, not gold) twice, in my junior and senior years of high school. At that point I had already been programming for close to ten years, and spent considerably more time coding than I spent studying math, but I would not have been much of an asset to an engineering team. I had no concept of how to plan a project, organize a codebase, design maintainable code, strategize a debugging session, evaluate tradeoffs, see between the lines of a poorly written requirements document, etc. Ege described it pretty well:
I think when you try to use the systems in practical situations; they might lose coherence over long chains of thought, or be unable to effectively debug non-performant complex code, or not be able to have as good intuitions about which research directions would be promising, et cetera.
This probably underestimates the degree to which IMO-silver-winning me would have struggled. For instance, I remember really struggling to debug binary tree rotation (a fairly simple bit of data-structure-and-algorithm work) for a college class, almost 2.5 years after my first silver.
[Ajeya] I think by the time systems are transformative enough to massively accelerate AI R&D, they will still not be that close to savannah-to-boardroom level transfer, but it will be fine because they will be trained on exactly what we wanted them to do for us.
This assumes we’re able to train them on exactly what we want them to do. It’s not obvious to me how we would train a model to do, for example, high-level software engineering? (In any case, I suspect that this is not far off from being AGI-complete; I would suspect the same of high-level work in most fields; see again my earlier-linked post on the skills involved in engineering.)
[Daniel] …here’s a scenario I think it would be productive to discuss:
(1) Q1 2024: A bigger, better model than GPT-4 is released by some lab. It’s multimodal; it can take a screenshot as input and output not just tokens but keystrokes and mouseclicks and images. Just like with GPT-4 vs. GPT-3.5 vs. GPT-3, it turns out to have new emergent capabilities. Everything GPT-4 can do, it can do better, but there are also some qualitatively new things that it can do (though not super reliably) that GPT-4 couldn’t do.
…
(6) Q3 2026 Superintelligent AGI happens, by whatever definition is your favorite. And you see it with your own eyes.
I realize you’re not explicitly labeling this as a prediction, but… isn’t this precisely the sort of thought process to which Hofstadter’s Law applies?
Thanks for this thoughtful and detailed and object-level critique! Just the sort of discussion I hope to inspire. Strong-upvoted.
Here are my point-by-point replies:
Of course there are workarounds for each of these issues, such as RAG for long-term memory, and multi-prompt approaches (chain-of-thought, tree-of-thought, AutoGPT, etc.) for exploratory work processes. But I see no reason to believe that they will work sufficiently well to tackle a week-long project. Briefly, my intuitive argument is that these are old school, rigid, GOFAI, Software 1.0 sorts of approaches, the sort of thing that tends to not work out very well in messy real-world situations. Many people have observed that even in the era of GPT-4, there is a conspicuous lack of LLMs accomplishing any really meaty creative work; I think these missing capabilities lie at the heart of the problem.
I agree that if no progress is made on long-term memory and iterative/exploratory work processes, we won’t have AGI. My position is that we are already seeing significant progress in these dimensions and that we will see more significant progress in the next 1-3 years. (If 4 years from now we haven’t seen such progress I’ll admit I was totally wrong about something). Maybe part of the disagreement between us is that the stuff you think are mere hacky workarounds, I think might work sufficiently well (with a few years of tinkering and experimentation perhaps).
Wanna make some predictions we could bet on? Some AI capability I expect to see in the next 3 years that you expect to not see?
Coding, in the sense that GPT4 can do it, is nowhere near the top of the hierarchy of skills involved in serious software engineering. And so I believe this is a bit like saying that, because a certain robot is already pretty decent at chiseling, it will soon be able to produce works of art at the same level as any human sculptor.
I think I just don’t buy this. I work at OpenAI R&D. I see how the sausage gets made. I’m not saying the whole sausage is coding, I’m saying a significant part of it is, and moreover that many of the bits GPT4 currently can’t do seem to me that they’ll be doable in the next few years.
If the delay in real-world economic value were due to “schlep”, shouldn’t we already see one-off demonstrations of LLMs performing economically-valuable-caliber tasks in the lab? For instance, regarding software engineering, maybe it takes a long time to create a packaged product that can be deployed in the field, absorb the context of a legacy codebase, etc. and perform useful high-level work. But if that’s the only problem, shouldn’t there already be at least one demonstration of an LLM doing some meaty software engineering project in a friendly lab environment somewhere? More generally, how do we define “schlep” such that the need for schlep explains the lack of visible accomplishments today, but also allows for AI systems be able to replace 99% of remote jobs within just four years?
To be clear, I do NOT think that today’s systems could replace 99% of remote jobs even with a century of schlep. And in particular I don’t think they are capable of massively automating AI R&D even with a century of schlep. I just think they could be producing, say, at least an OOM more economic value. My analogy here is to the internet; my understanding is that there were a bunch of apps that are super big now (amazon? tinder? twitter?) that were technically feasible on the hardware of 2000, but which didn’t just spring into the world fully formed in 2000 -- instead it took time for startups to form, ideas to be built and tested, markets to be disrupted, etc.
I define schlep the same way you do, I think.
What I predict will happen is basically described in the scenario I gave in the OP, though I think it’ll probably take slightly longer than that. I don’t want to say much detail I’m afraid because it might give the impression that I’m leaking OAI secrets (even though, to be clear, I’ve had these views since before I joined OAI)
I think when you try to use the systems in practical situations; they might lose coherence over long chains of thought, or be unable to effectively debug non-performant complex code, or not be able to have as good intuitions about which research directions would be promising, et cetera.
This was a nice answer from Ege. My follow up questions would be: Why? I have theories about what coherence is and why current models often lose it over long chains of thought (spoiler: they weren’t trained to have trains of thought) and theories about why they aren’t already excellent complex-code-debuggers (spoiler: they weren’t trained to be) etc. What’s your theory for why all the things AI labs will try between now and 2030 to make AIs good at these things will fail? Base models (gpt-3, gpt-4, etc.) aren’t out-of-the-box good at being helpful harmless chatbots or useful coding assistants. But with a bunch of tinkering and RLHF, they became good, and now they are used in the real world by a hundred million people a day. Again though I don’t want to get into details. I understand you might be skeptical that it can be done but I encourage you to red-team your position, and ask yourself ‘how would I do it, if I were an AI lab hell-bent on winning the AGI race?’ You might be able to think of some things. And if you can’t, I’d love to hear your thoughts on why it’s not possible. You might be right.
I realize you’re not explicitly labeling this as a prediction, but… isn’t this precisely the sort of thought process to which Hofstadter’s Law applies?
Indeed. Like I said, my timelines are based on a portfolio of different models/worlds; the very short-timelines models/worlds are basically like “look we basically already have the ingredients, we just need to assemble them, here is how to do it...” and the planning fallacy / hofstadter’s law 100% applies to this. The 5-year-and-beyond worlds are not like that; they are more like extrapolating trends and saying “sure looks like by 2030 we’ll have AIs that are superhuman at X, Y, Z, … heck all of our current benchmarks. And because of the way generalization/transfer/etc. and ML works they’ll probably also be broadly capable at stuff, not just narrowly good at these benchmarks. Hmmm. Seems like that could be AGI.” Note the absence of a plan here, I’m just looking at lines on graphs and then extrapolating them and then trying to visualize what the absurdly high values on those graphs mean for fuzzier stuff that isn’t being measured yet.
So my timelines do indeed take into account Hofstadter’s Law. If I wasn’t accounting for it already, my median would be lower than 2027. However, I am open to the criticism that maybe I am not accounting for it enough. However I am NOT open to the criticism that I should e.g. add 10 years to my timelines because of this. For reasons just explained. It’s a sort of “double or triple how long you think it’ll take to complete the plan” sort of thing, not a “10x how long you think it’ll take to complete the plan” sort of thing, and even if it was, then I’d just ditch the plan and look at the graphs.
Likewise, thanks for the thoughtful and detailed response. (And I hope you aren’t too impacted by current events...)
I agree that if no progress is made on long-term memory and iterative/exploratory work processes, we won’t have AGI. My position is that we are already seeing significant progress in these dimensions and that we will see more significant progress in the next 1-3 years. (If 4 years from now we haven’t seen such progress I’ll admit I was totally wrong about something). Maybe part of the disagreement between us is that the stuff you think are mere hacky workarounds, I think might work sufficiently well (with a few years of tinkering and experimentation perhaps).
Wanna make some predictions we could bet on? Some AI capability I expect to see in the next 3 years that you expect to not see?
Sure, that’d be fun, and seems like about the only reasonable next step on this branch of the conversation. Setting good prediction targets is difficult, and as it happens I just blogged about this. Off the top of my head, predictions could be around the ability of a coding AI to work independently over an extended period of time (at which point, it is arguably an “engineering AI”). Two different ways of framing it:
An AI coding assistant can independently complete 80% of real-world tasks that would take X amount of time for a reasonably skilled engineer who is already familiar with the general subject matter and the project/codebase to which the task applies.
An AI coding assistant can usefully operate independently for X amount of time, i.e. it is often productive to assign it a task and allow it to process for X time before checking in on it.
At first glance, (1) strikes me as a better, less-ambiguous framing. Of course it becomes dramatically more or less ambitious depending on X, also the 80% could be tweaked but I think this is less interesting (low percentages allow for a fluky, unreliable AI to pass the test; very high percentages seem likely to require superhuman performance in a way that is not relevant to what we’re trying to measure here).
It would be nice to have some prediction targets that more directly get at long-term memory and iterative/exploratory work processes, but as I discuss in the blog post, I don’t know how to construct such a target – open to suggestions.
Coding, in the sense that GPT4 can do it, is nowhere near the top of the hierarchy of skills involved in serious software engineering. And so I believe this is a bit like saying that, because a certain robot is already pretty decent at chiseling, it will soon be able to produce works of art at the same level as any human sculptor.
I think I just don’t buy this. I work at OpenAI R&D. I see how the sausage gets made. I’m not saying the whole sausage is coding, I’m saying a significant part of it is, and moreover that many of the bits GPT4 currently can’t do seem to me that they’ll be doable in the next few years.
Intuitively, I struggle with this, but you have inside data and I do not. Maybe we just set this point aside for now, we have plenty of other points we can discuss.
To be clear, I do NOT think that today’s systems could replace 99% of remote jobs even with a century of schlep. And in particular I don’t think they are capable of massively automating AI R&D even with a century of schlep. I just think they could be producing, say, at least an OOM more economic value. …
This, I would agree with. And on re-reading, I think I may have been mixed up as to what you and Ajeya were saying in the section I was quoting from here, so I’ll drop this.
[Ege] I think when you try to use the systems in practical situations; they might lose coherence over long chains of thought, or be unable to effectively debug non-performant complex code, or not be able to have as good intuitions about which research directions would be promising, et cetera.
This was a nice answer from Ege. My follow up questions would be: Why? I have theories about what coherence is and why current models often lose it over long chains of thought (spoiler: they weren’t trained to have trains of thought) and theories about why they aren’t already excellent complex-code-debuggers (spoiler: they weren’t trained to be) etc. What’s your theory for why all the things AI labs will try between now and 2030 to make AIs good at these things will fail?
I would not confidently argue that it won’t happen by 2030; I am suggesting that these problems are unlikely to be well solved in a usable-in-the-field form by 2027 (four years from now). My thinking:
The rapid progress in LLM capabilities has been substantially fueled by the availability of stupendous amounts of training data.
There is no similar abundance of low-hanging training data for extended (day/week/more) chains of thought, nor for complex debugging tasks. Hence, it will not be easy to extend LLMs (and/or train some non-LLM model) to high performance at these tasks.
A lot of energy will go into the attempt, which will eventually succeed. But per (2), I think some new techniques will be needed, which will take time to identify, refine, scale, and productize; a heavy lift in four years. (Basically: Hofstadter’s Law.)
Especially because I wouldn’t be surprised if complex-code-debugging turns out to be essentially “AGI-complete”, i.e. it may require a sufficiently varied mix of exploration, logical reasoning, code analysis, etc. that you pretty much have to be a general AGI to be able to do it well.
I understand you might be skeptical that it can be done but I encourage you to red-team your position, and ask yourself ‘how would I do it, if I were an AI lab hell-bent on winning the AGI race?’ You might be able to think of some things.
In a nearby universe, I would be fundraising for a startup to do exactly that, it sounds like a hell of fun problem. :-) And I’m sure you’re right… I just wouldn’t expect to get to “capable of 99% of all remote work” within four years.
I realize you’re not explicitly labeling this as a prediction, but… isn’t this precisely the sort of thought process to which Hofstadter’s Law applies?
Indeed. Like I said, my timelines are based on a portfolio of different models/worlds; the very short-timelines models/worlds are basically like “look we basically already have the ingredients, we just need to assemble them, here is how to do it...” and the planning fallacy / hofstadter’s law 100% applies to this. The 5-year-and-beyond worlds are not like that; they are … looking at lines on graphs and then extrapolating them …
So my timelines do indeed take into account Hofstadter’s Law. If I wasn’t accounting for it already, my median would be lower than 2027. However, I am open to the criticism that maybe I am not accounting for it enough.
To be clear, I’m only attempting to argue about the short-timeline worlds. I agree that Hofstadter’s Law doesn’t apply to curve extrapolation. (My intuition for 5-year-and-beyond worlds is more like Ege’s, but I have nothing coherent to add to the discussion on that front.) And so, yes, I think my position boils down to “I believe that, in your short-timeline worlds, you are not accounting for Hofstadter’s Law enough”.
As you proposed, I think the interesting place to go from here would be some predictions. I’ll noodle on this, and I’d be very interested to hear any thoughts you have – milestones along the path you envision in your default model of what rapid progress looks like; or at least, whatever implications thereof you feel comfortable talking about.
Oooh, I should have thought to ask you this earlier—what numbers/credences would you give for the stages in my scenario sketched in the OP? This might help narrow things down. My guess based on what you’ve said is that the biggest update for you would be Step 2, because that’s when it’s clear we have a working method for training LLMs to be continuously-running agents—i.e. long-term memory and continuous/exploratory work processes.
The timelines-relevant milestone of AGI is ability to autonomously research, especially AI’s ability to develop AI that doesn’t have particular cognitive limitations compared to humans. Quickly giving AIs experience at particular jobs/tasks that doesn’t follow from general intelligence alone is probably possible through learning things in parallel or through AIs experimenting with greater serial speed than humans can. Placing that kind of thing into AIs is the schlep that possibly stands in the way of reaching AGI (even after future scaling), and has to be done by humans. But also reaching AGI doesn’t require overcoming all important cognitive shortcomings of AIs compared to humans, only those that completely prevent AIs from quickly researching their way into overcoming the rest of the shortcomings on their own.
It’s currently unclear if merely scaling GPTs (multimodal LLMs) with just a bit more schlep/scaffolding won’t produce a weirdly disabled general intelligence (incapable of replacing even 50% of current fully remote jobs at a reasonable cost or at all) that is nonetheless capable enough to fix its disabilities shortly thereafter, making use of its ability to batch-develop such fixes much faster than humans would, even if it’s in some sense done in a monstrously inefficient way and takes another couple giant training runs (from when it starts) to get there. This will be clearer in a few years, after feasible scaling of base GPTs is mostly done, but we are not there yet.
More generally, how do we define “schlep” such that the need for schlep explains the lack of visible accomplishments today, but also allows for AI systems be able to replace 99% of remote jobs within just four years?
I think a lot of the forecasted schlep is not commercialization, but research and development to get working prototypes. It can be that there are no major ideas that you need to find, but that your current versions don’t really work because of a ton of finicky details that you haven’t optimized yet. But when you, your system will basically work.
This post taught me a lot about different ways of thinking about timelines, thanks to everyone involved!
I’d like to offer some arguments that, contra Daniel’s view, AI systems are highly unlikely to be able to replace 99% of current fully remote jobs anytime in the next 4 years. As a sample task, I’ll reference software engineering projects that take a reasonably skilled human practitioner one week to complete. I imagine that, for AIs to be ready for 99% of current fully remote jobs, they would need to be able to accomplish such a task. (That specific category might be less than 1% of all remote jobs, but I imagine that the class of remote jobs requiring at least this level of cognitive ability is more than 1%.)
Rather than referencing scaling laws, my arguments stem from analysis of two specific mechanisms which I believe are missing from current LLMs:
Long-term memory. LLMs of course have no native mechanism for retaining new information beyond the scope of their token buffer. I don’t think it is possible to carry out a complex extended task, such as a week-long software engineering project, without long-term memory to manage the task, keep track of intermediate thoughts regarding design approaches, etc.
Iterative / exploratory work processes. The LLM training process focuses on producing final work output in a single pass, with no planning process, design exploration, intermediate drafts, revisions, etc. I don’t think it is possible to accomplish a week-long software engineering task in a single pass; at least, not without very strongly superhuman capabilities (unlikely to be reached in just four years).
Of course there are workarounds for each of these issues, such as RAG for long-term memory, and multi-prompt approaches (chain-of-thought, tree-of-thought, AutoGPT, etc.) for exploratory work processes. But I see no reason to believe that they will work sufficiently well to tackle a week-long project. Briefly, my intuitive argument is that these are old school, rigid, GOFAI, Software 1.0 sorts of approaches, the sort of thing that tends to not work out very well in messy real-world situations. Many people have observed that even in the era of GPT-4, there is a conspicuous lack of LLMs accomplishing any really meaty creative work; I think these missing capabilities lie at the heart of the problem.
Nor do I see how we could expect another round or two of scaling to introduce the missing capabilities. The core problem is that we don’t have massive amounts of training data for managing long-term memory or carrying out exploratory work processes. Generating such data at the necessary scale, if it’s even possible, seems much harder than what we’ve been doing up to this point to marshall training data for LLMs.
The upshot is that I think that we have been seeing the rapid increase in capabilities of generative AI, failing to notice that this progress is confined to a particular subclass of tasks – namely, tasks which can pretty much be accomplished using System 1 alone – and collectively fooling ourselves into thinking that the trend of increasing capabilities is going to quickly roll through the remainder of human capabilities. In other words, I believe the assertion that the recent rate of progress will continue up through AGI is based on an overgeneralization. For an extended version of this claim, see a post I wrote a few months ago: The AI Progress Paradox. I’ve also written at greater length about the issues of Long-term memory and Exploratory work processes.
In the remainder of this comment, I’m going to comment what I believe are some weak points in the argument for short timelines (as presented in the original post).
Coding, in the sense that GPT4 can do it, is nowhere near the top of the hierarchy of skills involved in serious software engineering. And so I believe this is a bit like saying that, because a certain robot is already pretty decent at chiseling, it will soon be able to produce works of art at the same level as any human sculptor.
If the delay in real-world economic value were due to “schlep”, shouldn’t we already see one-off demonstrations of LLMs performing economically-valuable-caliber tasks in the lab? For instance, regarding software engineering, maybe it takes a long time to create a packaged product that can be deployed in the field, absorb the context of a legacy codebase, etc. and perform useful high-level work. But if that’s the only problem, shouldn’t there already be at least one demonstration of an LLM doing some meaty software engineering project in a friendly lab environment somewhere?
More generally, how do we define “schlep” such that the need for schlep explains the lack of visible accomplishments today, but also allows for AI systems be able to replace 99% of remote jobs within just four years?
What is your definition of “schlep”? I’d assumed it referred to the innumerable details of figuring out how to adapt and integrate a raw LLM into a finished product which can handle all of the messy requirements of real-world use cases – the “last mile” of unspoken requirements and funky edge cases. Shouldn’t we expect such things to be rather difficult to automate? Or do you mean something else by “schlep”?
Can you see LLMs acquiring long-term memory and an expert-level, nuanced ability to carry out extended exploratory processes by 2025? If yes, how do you see that coming about? If no, does that cause you to update at all?
Anecdote: I got IMO silver (granted, not gold) twice, in my junior and senior years of high school. At that point I had already been programming for close to ten years, and spent considerably more time coding than I spent studying math, but I would not have been much of an asset to an engineering team. I had no concept of how to plan a project, organize a codebase, design maintainable code, strategize a debugging session, evaluate tradeoffs, see between the lines of a poorly written requirements document, etc. Ege described it pretty well:
This probably underestimates the degree to which IMO-silver-winning me would have struggled. For instance, I remember really struggling to debug binary tree rotation (a fairly simple bit of data-structure-and-algorithm work) for a college class, almost 2.5 years after my first silver.
This assumes we’re able to train them on exactly what we want them to do. It’s not obvious to me how we would train a model to do, for example, high-level software engineering? (In any case, I suspect that this is not far off from being AGI-complete; I would suspect the same of high-level work in most fields; see again my earlier-linked post on the skills involved in engineering.)
I realize you’re not explicitly labeling this as a prediction, but… isn’t this precisely the sort of thought process to which Hofstadter’s Law applies?
Thanks for this thoughtful and detailed and object-level critique! Just the sort of discussion I hope to inspire. Strong-upvoted.
Here are my point-by-point replies:
I agree that if no progress is made on long-term memory and iterative/exploratory work processes, we won’t have AGI. My position is that we are already seeing significant progress in these dimensions and that we will see more significant progress in the next 1-3 years. (If 4 years from now we haven’t seen such progress I’ll admit I was totally wrong about something). Maybe part of the disagreement between us is that the stuff you think are mere hacky workarounds, I think might work sufficiently well (with a few years of tinkering and experimentation perhaps).
Wanna make some predictions we could bet on? Some AI capability I expect to see in the next 3 years that you expect to not see?
I think I just don’t buy this. I work at OpenAI R&D. I see how the sausage gets made. I’m not saying the whole sausage is coding, I’m saying a significant part of it is, and moreover that many of the bits GPT4 currently can’t do seem to me that they’ll be doable in the next few years.
To be clear, I do NOT think that today’s systems could replace 99% of remote jobs even with a century of schlep. And in particular I don’t think they are capable of massively automating AI R&D even with a century of schlep. I just think they could be producing, say, at least an OOM more economic value. My analogy here is to the internet; my understanding is that there were a bunch of apps that are super big now (amazon? tinder? twitter?) that were technically feasible on the hardware of 2000, but which didn’t just spring into the world fully formed in 2000 -- instead it took time for startups to form, ideas to be built and tested, markets to be disrupted, etc.
I define schlep the same way you do, I think.
What I predict will happen is basically described in the scenario I gave in the OP, though I think it’ll probably take slightly longer than that. I don’t want to say much detail I’m afraid because it might give the impression that I’m leaking OAI secrets (even though, to be clear, I’ve had these views since before I joined OAI)
This was a nice answer from Ege. My follow up questions would be: Why? I have theories about what coherence is and why current models often lose it over long chains of thought (spoiler: they weren’t trained to have trains of thought) and theories about why they aren’t already excellent complex-code-debuggers (spoiler: they weren’t trained to be) etc. What’s your theory for why all the things AI labs will try between now and 2030 to make AIs good at these things will fail? Base models (gpt-3, gpt-4, etc.) aren’t out-of-the-box good at being helpful harmless chatbots or useful coding assistants. But with a bunch of tinkering and RLHF, they became good, and now they are used in the real world by a hundred million people a day. Again though I don’t want to get into details. I understand you might be skeptical that it can be done but I encourage you to red-team your position, and ask yourself ‘how would I do it, if I were an AI lab hell-bent on winning the AGI race?’ You might be able to think of some things. And if you can’t, I’d love to hear your thoughts on why it’s not possible. You might be right.
Indeed. Like I said, my timelines are based on a portfolio of different models/worlds; the very short-timelines models/worlds are basically like “look we basically already have the ingredients, we just need to assemble them, here is how to do it...” and the planning fallacy / hofstadter’s law 100% applies to this. The 5-year-and-beyond worlds are not like that; they are more like extrapolating trends and saying “sure looks like by 2030 we’ll have AIs that are superhuman at X, Y, Z, … heck all of our current benchmarks. And because of the way generalization/transfer/etc. and ML works they’ll probably also be broadly capable at stuff, not just narrowly good at these benchmarks. Hmmm. Seems like that could be AGI.” Note the absence of a plan here, I’m just looking at lines on graphs and then extrapolating them and then trying to visualize what the absurdly high values on those graphs mean for fuzzier stuff that isn’t being measured yet.
So my timelines do indeed take into account Hofstadter’s Law. If I wasn’t accounting for it already, my median would be lower than 2027. However, I am open to the criticism that maybe I am not accounting for it enough. However I am NOT open to the criticism that I should e.g. add 10 years to my timelines because of this. For reasons just explained. It’s a sort of “double or triple how long you think it’ll take to complete the plan” sort of thing, not a “10x how long you think it’ll take to complete the plan” sort of thing, and even if it was, then I’d just ditch the plan and look at the graphs.
Likewise, thanks for the thoughtful and detailed response. (And I hope you aren’t too impacted by current events...)
Sure, that’d be fun, and seems like about the only reasonable next step on this branch of the conversation. Setting good prediction targets is difficult, and as it happens I just blogged about this. Off the top of my head, predictions could be around the ability of a coding AI to work independently over an extended period of time (at which point, it is arguably an “engineering AI”). Two different ways of framing it:
An AI coding assistant can independently complete 80% of real-world tasks that would take X amount of time for a reasonably skilled engineer who is already familiar with the general subject matter and the project/codebase to which the task applies.
An AI coding assistant can usefully operate independently for X amount of time, i.e. it is often productive to assign it a task and allow it to process for X time before checking in on it.
At first glance, (1) strikes me as a better, less-ambiguous framing. Of course it becomes dramatically more or less ambitious depending on X, also the 80% could be tweaked but I think this is less interesting (low percentages allow for a fluky, unreliable AI to pass the test; very high percentages seem likely to require superhuman performance in a way that is not relevant to what we’re trying to measure here).
It would be nice to have some prediction targets that more directly get at long-term memory and iterative/exploratory work processes, but as I discuss in the blog post, I don’t know how to construct such a target – open to suggestions.
Intuitively, I struggle with this, but you have inside data and I do not. Maybe we just set this point aside for now, we have plenty of other points we can discuss.
This, I would agree with. And on re-reading, I think I may have been mixed up as to what you and Ajeya were saying in the section I was quoting from here, so I’ll drop this.
I would not confidently argue that it won’t happen by 2030; I am suggesting that these problems are unlikely to be well solved in a usable-in-the-field form by 2027 (four years from now). My thinking:
The rapid progress in LLM capabilities has been substantially fueled by the availability of stupendous amounts of training data.
There is no similar abundance of low-hanging training data for extended (day/week/more) chains of thought, nor for complex debugging tasks. Hence, it will not be easy to extend LLMs (and/or train some non-LLM model) to high performance at these tasks.
A lot of energy will go into the attempt, which will eventually succeed. But per (2), I think some new techniques will be needed, which will take time to identify, refine, scale, and productize; a heavy lift in four years. (Basically: Hofstadter’s Law.)
Especially because I wouldn’t be surprised if complex-code-debugging turns out to be essentially “AGI-complete”, i.e. it may require a sufficiently varied mix of exploration, logical reasoning, code analysis, etc. that you pretty much have to be a general AGI to be able to do it well.
In a nearby universe, I would be fundraising for a startup to do exactly that, it sounds like a hell of fun problem. :-) And I’m sure you’re right… I just wouldn’t expect to get to “capable of 99% of all remote work” within four years.
To be clear, I’m only attempting to argue about the short-timeline worlds. I agree that Hofstadter’s Law doesn’t apply to curve extrapolation. (My intuition for 5-year-and-beyond worlds is more like Ege’s, but I have nothing coherent to add to the discussion on that front.) And so, yes, I think my position boils down to “I believe that, in your short-timeline worlds, you are not accounting for Hofstadter’s Law enough”.
As you proposed, I think the interesting place to go from here would be some predictions. I’ll noodle on this, and I’d be very interested to hear any thoughts you have – milestones along the path you envision in your default model of what rapid progress looks like; or at least, whatever implications thereof you feel comfortable talking about.
Oooh, I should have thought to ask you this earlier—what numbers/credences would you give for the stages in my scenario sketched in the OP? This might help narrow things down. My guess based on what you’ve said is that the biggest update for you would be Step 2, because that’s when it’s clear we have a working method for training LLMs to be continuously-running agents—i.e. long-term memory and continuous/exploratory work processes.
The timelines-relevant milestone of AGI is ability to autonomously research, especially AI’s ability to develop AI that doesn’t have particular cognitive limitations compared to humans. Quickly giving AIs experience at particular jobs/tasks that doesn’t follow from general intelligence alone is probably possible through learning things in parallel or through AIs experimenting with greater serial speed than humans can. Placing that kind of thing into AIs is the schlep that possibly stands in the way of reaching AGI (even after future scaling), and has to be done by humans. But also reaching AGI doesn’t require overcoming all important cognitive shortcomings of AIs compared to humans, only those that completely prevent AIs from quickly researching their way into overcoming the rest of the shortcomings on their own.
It’s currently unclear if merely scaling GPTs (multimodal LLMs) with just a bit more schlep/scaffolding won’t produce a weirdly disabled general intelligence (incapable of replacing even 50% of current fully remote jobs at a reasonable cost or at all) that is nonetheless capable enough to fix its disabilities shortly thereafter, making use of its ability to batch-develop such fixes much faster than humans would, even if it’s in some sense done in a monstrously inefficient way and takes another couple giant training runs (from when it starts) to get there. This will be clearer in a few years, after feasible scaling of base GPTs is mostly done, but we are not there yet.
I think a lot of the forecasted schlep is not commercialization, but research and development to get working prototypes. It can be that there are no major ideas that you need to find, but that your current versions don’t really work because of a ton of finicky details that you haven’t optimized yet. But when you, your system will basically work.