Here’s the structure of the argument that I am most compelled by (I call it the benchmarks + gaps argument), I’m uncertain about the details.
Focus on the endpoint of substantially speeding up AI R&D / automating research engineering. Let’s define our timelines endpoint as something that ~5xs the rate of AI R&D algorithmic progress (compared to a counterfactual world with no post-2024 AIs). Then make an argument that ~fully automating research engineering (experiment implementation/monitoring) would do this, along with research taste of at least the 50th percentile AGI company researcher (experiment ideation/selection).
Focus on REBench since it’s the most relevant benchmark. REBench is the most relevant benchmark here, for simplicity I’ll focus on only this though for robustness more benchmarks should be considered.
Based on trend extrapolation and benchmark base rates, roughly 50% we’ll saturate REBench by end of 2025.
Identify the most important gaps between saturating REBench and the endpoint defined in (1). The most important gaps between saturating REBench and achieving the 5xing AI R&D algorithmic progress are: (a) time horizon as measured by human time spent (b) tasks with worse feedback loops (c) tasks with large codebases (d) becoming significantly cheaper and/or faster than humans. There are some more but my best guess is that these 4 are the most important, should also take into account unknown gaps.
When forecasting the time to cross the gaps, it seems quite plausible that we get to the substantial AI R&D speedup within a few years after saturating REBench, so by end of 2028 (and significantly earlier doesn’t seem crazy).
This is the most important part of the argument, and one that I have lots of uncertainty over. We have some data regarding the “crossing speed” of some of the gaps but the data are quite limited at the moment. So there are a lot of judgment calls needed and people with strong long timelines intuitions might think the remaining gaps will take a long time to cross without this being close to falsified by our data.
This is broken down into “time to cross the gaps at 2024 pace of progress” → adjusting based on compute forecasts and intermediate AI R&D speedups before reaching 5x.
From substantial AI R&D speedup to AGI. Once we have the 5xing AIs, that’s potentially already AGI by some definitions but if you have a stronger one, the possibility of a somewhat fast takeoff means you might get it within a year or so after.
One reason I like this argument is that it will get much stronger over time as we get more difficult benchmarks and otherwise get more data about how quickly the gaps are being crossed.
I have a longer draft which makes this argument but it’s quite messy and incomplete and might not add much on top of the above summary for now. Unfortunately I’m prioritizing other workstreams over finishing this at the moment. DM me if you’d really like a link to the messy draft.
RE-bench tasks (see page 7 here) are not the kind of AI research where you’re developing new AI paradigms and concepts. The tasks are much more straightforward than that. So your argument is basically assuming without argument that we can get to AGI with just the more straightforward stuff, as opposed to new AI paradigms and concepts.
If we do need new AI paradigms and concepts to get to AGI, then there would be a chicken-and-egg problem in automating AI research. Or more specifically, there would be two categories of AI R&D, with the less important R&D category (e.g. performance optimization and other REbench-type tasks) being automatable by near-future AIs, and the more important R&D category (developing new AI paradigms and concepts) not being automatable.
(Obviously you’re entitled to argue / believe that we don’t need need new AI paradigms and concepts to get to AGI! It’s a topic where I think reasonable people disagree. I’m just suggesting that it’s a necessary assumption for your argument to hang together, right?)
I disagree. I think the existing body of published computer science and neuroscience research are chock full of loose threads. Tons of potential innovations just waiting to be harvested by automated researchers. I’ve mentioned this idea elsewhere. I call it an ‘innovation overhang’.
Simply testing interpolations and extrapolations (e.g. scaling up old forgotten ideas on modern hardware) seems highly likely to reveal plenty of successful new concepts, even if the hit rate per attempt is low.
I think this means a better benchmark would consist of: taking two existing papers, finding a plausible hypothesis which combines the assumptions from the papers, designs and codes and runs tests, then reports on results.
So I don’t think “no new concepts” is a necessary assumption for getting to AGI quickly with the help of automated researchers.
Simply testing interpolations and extrapolations (e.g. scaling up old forgotten ideas on modern hardware) seems highly likely to reveal plenty of successful new concepts, even if the hit rate per attempt is low
Is this bottlenecked by programmer time or by compute cost?
Both? If you increase only one of the two the other becomes the bottleneck?
I agree this means that the decision to devote substantial compute to both inference and to assigning compute resources for running experiments designed by AI reseachers is a large cost. Presumably, as the competence of the AI reseachers gets higher, it feels easier to trust them not to waste their assigned experiment compute.
There was discussion on Dwarkesh Patel’s interview with researcher friends where there was mention that AI reseachers are already restricted by compute granted to them for experiments. Probably also on work hours per week they are allowed to spend on novel “off the main path” research.
So in order for there to be a big surge in AI R&D there’d need to be prioritization of that at a high level. This would be a change of direction from focusing primarily on scaling current techniques rapidly, and putting out slightly better products ASAP.
So yes, if you think that this priority shift won’t happen, then you should doubt that the increase in R&D speed my model predicts will occur.
But what would that world look like? Probably a world where scaling continues to pay dividends, and getting to AGI is more straightforward yhan Steve Byrnes or I expect.
I agree that that’s a substantial probability, but it’s also an AGI-soon sort of world.
I argue that for AGI to be not-soon, you need both scaling to fail and for algorithm research to fail.
Both? If you increase only one of the two the other becomes the bottleneck?
My impression based on talking to people at labs plus stuff I’ve read is that
Most AI researchers have no trouble coming up with useful ways of spending all of the compute available to them
Most of the expense of hiring AI reseachers is compute costs for their experiments rather than salary
The big scaling labs try their best to hire the very best people they can get their hands on and concentrate their resources heavily into just a few teams, rather than trying to hire everyone with a pulse who can rub two tensors together.
(Very open to correction by people closer to the big scaling labs).
My model, then, says that compute availability is a constraint that binds much harder than programming or research ability, at least as things stand right now.
There was discussion on Dwarkesh Patel’s interview with researcher friends where there was mention that AI reseachers are already restricted by compute granted to them for experiments. Probably also on work hours per week they are allowed to spend on novel “off the main path” research.
Sounds plausible to me. Especially since benchmarks encourage a focus on ability to hit the target at all rather than ability to either succeed or fail cheaply, which is what’s important in domains where the salary / electric bill of the experiment designer is an insignificant fraction of the total cost of the experiment.
But what would that world look like? [...] I agree that that’s a substantial probability, but it’s also an AGI-soon sort of world.
Yeah, I expect it’s a matter of “dumb” scaling plus experimentation rather than any major new insights being needed. If scaling hits a wall that training on generated data + fine tuning + routing + specialization can’t overcome, I do agree that innovation becomes more important than iteration.
My model is not just “AGI-soon” but “the more permissive thresholds for when something should be considered AGI have already been met, and more such thresholds will fall in short order, and so we should stop asking when we will get AGI and start asking about when we will see each of the phenomena that we are using AGI as a proxy for”.
I think you’re mostly correct about current AI reseachers being able to usefully experiment with all the compute they have available.
I do think there are some considerations here though.
How closely are they adhering to the “main path” of scaling existing techniques with minor tweaks? If you want to know how a minor tweak affects your current large model at scale, that is a very compute-heavy researcher-time-light type of experiment. On the other hand, if you want to test a lot of novel new paths at much smaller scales, then you are in a relatively compute-light but researcher-time-heavy regime.
What fraction of the available compute resources is the company assigning to each of training/inference/experiments? My guess it that the current split is somewhere around 63/33/4. If this was true, and the company decided to pivot away from training to focus on experiments (0/33/67), this would be something like a 16x increase in compute for experiments. So maybe that changes the bottleneck?
We do indeed seem to be at “AGI for most stuff”, but with a spikey envelope of capability that leaves some dramatic failure modes. So it does make more sense to ask something like, “For remaining specific weakness X, what will the research agenda and timeline look like?”
This makes more sense then continuing to ask the vague “AGI complete” question when we are most of the way there already.
It sounds like your disagreement isn’t with drawing a link from RE-bench to (forecasts for) automating research engineering, but is instead with thinking that you can get AGI shortly after automating research engineering due to AI R&D acceleration and already being pretty close. Is that right?
Note that the comment says research engineering, not research scientists.
For context in a sibling comment Ryan said and Steven agreed with:
It sounds like your disagreement isn’t with drawing a link from RE-bench to (forecasts for) automating research engineering, but is instead with thinking that you can get AGI shortly after automating research engineering due to AI R&D acceleration and already being pretty close. Is that right?
Note that the comment says research engineering, not research scientists.
Now responding on whether I think the no new paradigms assumption is needed:
(Obviously you’re entitled to argue / believe that we don’t need need new AI paradigms and concepts to get to AGI! It’s a topic where I think reasonable people disagree. I’m just suggesting that it’s a necessary assumption for your argument to hang together, right?)
I generally have not been thinking in these sorts of binary terms but instead thinking in terms more like “Algorithmic progress research is moving at pace X today, if we had automated research engineers it would be sped up to N*X.” I’m not necessarily taking a stand on whether the progress will involve new paradigms or not, so I don’t think it requires an assumption of no new paradigms.
However:
If you think almost all new progress in some important sense will come from paradigm shifts, the forecasting method becomes weaker because the incremental progress doesn’t say as much about progress toward automated research engineering or AGI.
You might think that it’s more confusing than clarifying to think in terms of collapsing all research progress into a single “speed” and forecasting based on that.
Requiring a paradigm shift might lead to placing less weight on lower amounts of research effort required, and even if the probability distribution is the same what we should expect to see in the world leading up to AGI is not.
I’d also add that:
Regarding what research tasks I’m forecasting for the automated research engineer: REBench is not supposed to fully represent the tasks involved in actual research engineering. That’s why we have the gaps.
Regarding to what extent having an automated research engineer would speed up progress in worlds in which we need a paradigm shift: I think it’s hard to separate out conceptual from engineering/empirical work in terms of progress toward new paradigms. My guess would be being able to implement experiments very cheaply would substantially increase the expected number of paradigm shifts per unit time.
In the framework of the argument, you seem to be objecting to premises 4-6. Specifically you seem to be saying “There’s another important gap between RE-bench saturation and completely automating AI R&D: new-paradigm-and-concept-generation. Perhaps we can speed up AI R&D by 5x or so without crossing this gap, simply by automating engineering, but to get to AGI we’ll need to cross this gap, and this gap might take a long time to cross even at 5x speed.”
(Is this a fair summary?)
If that’s what you are saying, I think I’d reply:
We already have a list of potential gaps, and this one seems to be a mediocre addition to the list IMO. I don’t think this distinction between old-paradigm/old-concepts and new-paradigm/new-concepts is going to hold up very well to philosophical inspection or continued ML progress; it smells similar to ye olde “do LLMs truly understand, or are they merely stochastic parrots?” and “Can they extrapolate, or do they merely interpolate?”
That said, I do think it’s worthy of being included on the list. I’m just not as excited about it as the other entries, especially (a) and (b).
I’d also say: What makes you think that this gap will take years to cross even at 5x speed? (i.e. even when algorithmic progress is 5x faster than it has been for the past decade) Do you have a positive argument, or is it just generic uncertainty / absence-of-evidence?
(For context: I work in the same org as Eli and basically agree with his argument above)
I think I’m objecting to (as Eli wrote) “collapsing all [AI] research progress into a single “speed” and forecasting based on that”. There can be different types of AI R&D, and we might be able to speed up some types without speeding up other types. For example, coming up with the AlphaGo paradigm (self-play, MCTS, ConvNets, etc.) or LLM paradigm (self-supervised pretraining, Transformers, etc.) is more foundational, whereas efficiently implementing and debugging a plan is less foundational. (Kinda “science vs engineering”?) I also sometimes use the example of Judea Pearl coming up with the belief prop algorithm in 1982. If everyone had tons of compute and automated research engineer assistants, would we have gotten belief prop earlier? I’m skeptical. As far as I understand: Belief prop was not waiting on compute. You can do belief prop on a 1960s mainframe. Heck, you can do belief prop on an abacus. Social scientists have been collecting data since the 1800s, and I imagine that belief prop would have been useful for analyzing at least some of that data, if only someone had invented it.
Indeed. Not only could belief prop have been invented in 1960, it was invented around 1960 (published 1962, “Low density parity check codes”, IRE Transactions on Information Theory) by Robert Gallager, as a decoding algorithm for error correcting codes.
I recognized that Gallager’s method was the same as Pearl’s belief propagation in 1996 (MacKay and Neal, ``Near Shannon limit performance of low density parity check codes″, Electronics Letters, vol. 33, pp. 457-458).
This says something about the ability of AI to potentially speed up research by simply linking known ideas (even if it’s not really AGI).
Came here to say this, got beaten to it by Radford Neal himself, wow! Well, I’m gonna comment anyway, even though it’s mostly been said.
Gallagher proposed belief propagation as an approximate good-enough method of decoding a certain error-correcting code, but didn’t notice that it worked on all sorts of probability problems. Pearl proposed it as a general mechanism for dealing with probability problems, but wanted perfect mathematical correctness, so confined himself to tree-shaped problems. It was their common generalization that was the real breakthrough: an approximate good-enough solution to all sorts of problems. Which is what Pearl eventually noticed, so props to him.
If we’d had AGI in the 1960s, someone with a probability problem could have said “Here’s my problem. For every paper in the literature, spawn an instance to read that paper and tell me if it has any help for my problem.” It would have found Gallagher’s paper and said “Maybe you could use this?”
I think I’m objecting to (as Eli wrote) “collapsing all [AI] research progress into a single “speed” and forecasting based on that”. There can be different types of AI R&D, and we might be able to speed up some types without speeding up other types.
…is parallel to what we see in other kinds of automation.
The technology of today has been much better at automating the production of clocks than the production of haircuts. Thus, 2024 technology is great at automating the production of some physical things but only slightly helpful for automating the production of some other physical things.
By the same token, different AI R&D projects are trying to “produce” different types of IP. Thus, it’s similarly possible that 2029 AI technology will be great at automating the production of some types of AI-related IP but only slightly helpful for automating the production of some other types of AI-related IP.
I disagree that there is a difference of kind between “engineering ingenuity” and “scientific discovery”, at least in the business of AI. The examples you give—self-play, MCTS, ConvNets—were all used in game-playing programs before AlphaGo. The trick of AlphaGo was to combine them, and then discover that it worked astonishingly well. It was very clever and tasteful engineering to combine them, but only a breakthrough in retrospect. And the people that developed them each earlier, for their independent purposes? They were part of the ordinary cycle of engineering development: “Look at a problem, think as hard as you can, come up with something, try it, publish the results.” They’re just the ones you remember, because they were good.
Paradigm shifts do happen, but I don’t think we need them between here and AGI.
Yeah I’m definitely describing something as a binary when it’s really a spectrum. (I was oversimplifying since I didn’t think it mattered for that particular context.)
In the context of AI, I don’t know what the difference is (if any) between engineering and science. You’re right that I was off-base there…
…But I do think that there’s a spectrum from ingenuity / insight to grunt-work.
So I’m bringing up a possible scenario where near-future AI gets progressively less useful as you move towards the ingenuity side of that spectrum, and where changing that situation (i.e., automating ingenuity) itself requires a lot of ingenuity, posing a chicken-and-egg problem / bottleneck that limits the scope of rapid near-future recursive AI progress.
Paradigm shifts do happen, but I don’t think we need them between here and AGI.
I certainly agree that the collapse is a lossy abstraction / simplifies; in reality some domains of research will speed up more than 5x and others less than 5x, for example, even if we did get automated research engineers dropped on our heads tomorrow. Are you therefore arguing that in particular, the research needed to get to AGI is of the kind that won’t be sped up significantly? What’s the argument—that we need a new paradigm to get to AIs that can generate new paradigms, and being able to code really fast and well won’t majorly help us think of new paradigms? (I’d disagree with both sub-claims of that claim)
Are you therefore arguing that in particular, the research needed to get to AGI is of the kind that won’t be sped up significantly? What’s the argument—that we need a new paradigm to get to AIs that can generate new paradigms, and being able to code really fast and well won’t majorly help us think of new paradigms? (I’d disagree with both sub-claims of that claim)
Yup! Although I’d say I’m “bringing up a possibility” rather than “arguing” in this particular thread. And I guess it depends on where we draw the line between “majorly” and “minorly” :)
This is clarifying for me, appreciate it. If I believed (a) that we needed a paradigm shift like the ones to LLMs in order to get AI systems resulting in substantial AI R&D speedup, and (b) that trend extrapolation from benchmark data would not be informative for predicting these paradigm shifts, then I would agree that the benchmarks + gaps method is not particularly informative.
Do you think that’s a fair summary of (this particular set of) necessary conditions?
(edit: didn’t see @Daniel Kokotajlo’s new comment before mine. I agree with him regarding disagreeing with both sub-claims but I think I have a sense of where you’re coming from.)
I don’t think this distinction between old-paradigm/old-concepts and new-paradigm/new-concepts is going to hold up very well to philosophical inspection or continued ML progress; it smells similar to ye olde “do LLMs truly understand, or are they merely stochastic parrots?” and “Can they extrapolate, or do they merely interpolate?”
I find this kind of pattern-match pretty unconvincing without more object-level explanation. Why exactly do you think this distinction isn’t important? (I’m also not sure “Can they extrapolate, or do they merely interpolate?” qualifies as “ye olde,” still seems like a good question to me at least w.r.t. sufficiently out-of-distribution extrapolation.)
We are at an impasse then; I think basically I’m just the mirror of you. To me, the burden is on whoever thinks the distinction is important to explain why it matters. Current LLMs do many amazing things that many people—including AI experts—thought LLMs could never do due to architectural limitations. Recent history is full of examples of AI experts saying “LLMs are the offramp to AGI; they cannot do X; we need new paradigm to do X” and then a year or two later LLMs are doing X. So now I’m skeptical and would ask questions like: “Can you say more about this distinction—is it a binary, or a dimension? If it’s a dimension, how can we measure progress along it, and are we sure there hasn’t been significant progress on it already in the last few years, within the current paradigm? If there has indeed been no significant progress (as with ARC-AGI until 2024) is there another explanation for why that might be, besides your favored one (that your distinction is super important and that because of it a new paradigm is needed to get to AGI)”
And I think you’re admitting that your argument is “if we mush all capabilities together into one dimension, AI is moving up on that one dimension, so things will keep going up”.
Would you say the same thing about the invention of search engines? That was a huge jump in the capability of our computers. And it looks even more impressive if you blur out your vision—pretend you don’t know that the text that comes up on your screen is written by a humna, and pretend you don’t know that search is a specific kind of task distinct from a lot of other activity that would be involved in “True Understanding, woooo”—and just say “wow! previously our computers couldn’t write a poem, but now with just a few keystrokes my computer can literally produce Billy Collins level poetry!”.
Blurring things together at that level works for, like, macroeconomic trends. But if you look at macroeconomic trends it doesn’t say singularity in 2 years! Going to 2 or 10 years is an inside-view thing to conclude! You’re making some inference like “there’s an engine that is very likely operating here, that takes us to AGI in xyz years”.
I’m not saying that. You are the one who introduced the concept of “the core algorithms for intelligence;” you should explain what that means and why it’s a binary (or if it’s not a binary but rather a dimension, why we haven’t been moving along that dimension in recent past.
ETA: I do have an ontology, a way of thinking about these things, that is more sophisticated than simply mushing all capabilities together into one dimension. I just don’t accept your ontology yet.
(I might misunderstand you. My impression was that you’re saying it’s valid to extrapolate from “model XYZ does well at RE-Bench” to “model XYZ does well at developing new paradigms and concepts.” But maybe you’re saying that the trend of LLM success at various things suggests we don’t need new paradigms and concepts to get AGI in the first place? My reply below assumes the former:)
I’m not saying LLMs can’t develop new paradigms and concepts, though. The original claim you were responding to was that success at RE-Bench in particular doesn’t tell us much about success at developing new paradigms and concepts. “LLMs have done various things some people didn’t expect them to be able to do” doesn’t strike me as much of an argument against that.
More broadly, re: your burden of proof claim, I don’t buy that “LLMs have done various things some people didn’t expect them to be able to do” determinately pins down an extrapolation to “the current paradigm(s) will suffice for AGI, within 2-3 years.” That’s not a privileged reference class forecast, it’s a fairly specific prediction.
I feel like this sub-thread is going in circles; perhaps we should go back to the start of it. I said:
I don’t think this distinction between old-paradigm/old-concepts and new-paradigm/new-concepts is going to hold up very well to philosophical inspection or continued ML progress; it smells similar to ye olde “do LLMs truly understand, or are they merely stochastic parrots?” and “Can they extrapolate, or do they merely interpolate?”
You replied:
I find this kind of pattern-match pretty unconvincing without more object-level explanation. Why exactly do you think this distinction isn’t important? (I’m also not sure “Can they extrapolate, or do they merely interpolate?” qualifies as “ye olde,” still seems like a good question to me at least w.r.t. sufficiently out-of-distribution extrapolation.)
Now, elsewhere in this comment section, various people (Carl, Radford) have jumped in to say the sorts of object-level things I also would have said if I were going to get into it. E.g. that old vs. new paradigm isn’t a binary but a spectrum, that automating research engineering WOULD actually speed up new-paradigm discovery, etc. What do you think of the points they made?
Also, I’m still waiting to hear answers to these questions: “Can you say more about this distinction—is it a binary, or a dimension? If it’s a dimension, how can we measure progress along it, and are we sure there hasn’t been significant progress on it already in the last few years, within the current paradigm? If there has indeed been no significant progress (as with ARC-AGI until 2024) is there another explanation for why that might be, besides your favored one (that your distinction is super important and that because of it a new paradigm is needed to get to AGI)”
@elifland what do you think is the strongest argument for long(er) timelines? Do you think it’s essentially just “it takes a long time for researchers learn how to cross the gaps”?
Or do you think there’s an entirely different frame (something that’s in an ontology that just looks very different from the one presented in the “benchmarks + gaps argument”?)
A few possible categories of situations we might have long timelines, off the top of my head:
Benchmarks + gaps is still best: overall gap is somewhat larger + slowdown in compute doubling time after 2028, but trend extrapolations still tell us something about gap trends: This is how I would most naturally think about how timelines through maybe the 2030s are achieved, and potentially beyond if neither of the next hold.
Others are best (more than of one of these can be true):
The current benchmarks and evaluations are so far away from AGI that trends on them don’t tell us anything (including regarding how fast gaps might be crossed). In this case one might want to identify the 1-2 most important gaps and reason about when we will cross these based on gears-level reasoning or trend extrapolation/forecasting on “real-world” data (e.g. revenue?) rather than trend extrapolation on benchmarks. Example candidate “gaps” that I often hear for these sorts of cases are the lack of feedback loops and the “long-tail of tasks” / reliability.
A paradigm shift in AGI training is needed and benchmark trends don’t tell us much about when we will achieve this (this is basically Steven’s sibling comment): in this case the best analysis might involve looking at the base rate of paradigm shifts per research effort, and/or looking at specific possible shifts.
^ this taxonomy is not comprehensive, just things I came up with quickly. Might be missing something that would be good.
To cop out answer your question, I feel like if I were making a long-timelines argument I’d argue that all 3 of those would be ways of forecasting to give weight to, then aggregate. If I had to choose just one I’d probably still go with (1) though.
edit: oh there’s also the “defer to AI experts” argument. I mostly try not to think about deference-based arguments because thinking on the object-level is more productive, though I think if I were really trying to make an all-things-considered timelines distribution there’s some chance I would adjust to longer due to deference arguments (but also some chance I’d adjust toward shorter, given that lots of people who have thought deeply about AGI / are close to the action have short timelines).
There’s also “base rate of super crazy things happening is low” style arguments which I don’t give much weight to.
Thanks. I think this argument assumes that the main bottleneck to AI progress is something like research engineering speed, such that accelerating research engineering speed would drastically increase AI progress?
I think that that makes sense as long as we are talking about domains like games / math / programming where you can automatically verify the results, but that something like speed of real-world interaction becomes the bottleneck once shifting to more open domains.
Consider an AI being trained on a task such as “acting as the CEO for a startup”. There may not be a way to do this training other than to have it actually run a real startup, and then wait for several years to see how the results turn out. Even after several years, it will be hard to say exactly which parts of the decision process contributed, and how much of the startup’s success or failure was due to random factors. Furthermore, during this process the AI will need to be closely monitored in order to make sure that it does not do anything illegal or grossly immoral, slowing down its decision process and thus whole the training. And I haven’t even mentioned the expense of a training run where running just a single trial requires a startup-level investment (assuming that the startup won’t pay back its investment, of course).
Of course, humans do not learn to be CEOs by running a million companies and then getting a reward signal at the end. Human CEOs come in with a number of skills that they have already learned from somewhere else that they then apply to the context of running a company, shifting between their existing skills and applying them as needed. However, the question of what kind of approach and skill to apply in what situation, and how to prioritize between different approaches, is by itself a skillset that needs to be learned… quite possibly through a lot of real-world feedback.
Here’s the structure of the argument that I am most compelled by (I call it the benchmarks + gaps argument), I’m uncertain about the details.
Focus on the endpoint of substantially speeding up AI R&D / automating research engineering. Let’s define our timelines endpoint as something that ~5xs the rate of AI R&D algorithmic progress (compared to a counterfactual world with no post-2024 AIs). Then make an argument that ~fully automating research engineering (experiment implementation/monitoring) would do this, along with research taste of at least the 50th percentile AGI company researcher (experiment ideation/selection).
Focus on REBench since it’s the most relevant benchmark. REBench is the most relevant benchmark here, for simplicity I’ll focus on only this though for robustness more benchmarks should be considered.
Based on trend extrapolation and benchmark base rates, roughly 50% we’ll saturate REBench by end of 2025.
Identify the most important gaps between saturating REBench and the endpoint defined in (1). The most important gaps between saturating REBench and achieving the 5xing AI R&D algorithmic progress are: (a) time horizon as measured by human time spent (b) tasks with worse feedback loops (c) tasks with large codebases (d) becoming significantly cheaper and/or faster than humans. There are some more but my best guess is that these 4 are the most important, should also take into account unknown gaps.
When forecasting the time to cross the gaps, it seems quite plausible that we get to the substantial AI R&D speedup within a few years after saturating REBench, so by end of 2028 (and significantly earlier doesn’t seem crazy).
This is the most important part of the argument, and one that I have lots of uncertainty over. We have some data regarding the “crossing speed” of some of the gaps but the data are quite limited at the moment. So there are a lot of judgment calls needed and people with strong long timelines intuitions might think the remaining gaps will take a long time to cross without this being close to falsified by our data.
This is broken down into “time to cross the gaps at 2024 pace of progress” → adjusting based on compute forecasts and intermediate AI R&D speedups before reaching 5x.
From substantial AI R&D speedup to AGI. Once we have the 5xing AIs, that’s potentially already AGI by some definitions but if you have a stronger one, the possibility of a somewhat fast takeoff means you might get it within a year or so after.
One reason I like this argument is that it will get much stronger over time as we get more difficult benchmarks and otherwise get more data about how quickly the gaps are being crossed.
I have a longer draft which makes this argument but it’s quite messy and incomplete and might not add much on top of the above summary for now. Unfortunately I’m prioritizing other workstreams over finishing this at the moment. DM me if you’d really like a link to the messy draft.
RE-bench tasks (see page 7 here) are not the kind of AI research where you’re developing new AI paradigms and concepts. The tasks are much more straightforward than that. So your argument is basically assuming without argument that we can get to AGI with just the more straightforward stuff, as opposed to new AI paradigms and concepts.
If we do need new AI paradigms and concepts to get to AGI, then there would be a chicken-and-egg problem in automating AI research. Or more specifically, there would be two categories of AI R&D, with the less important R&D category (e.g. performance optimization and other REbench-type tasks) being automatable by near-future AIs, and the more important R&D category (developing new AI paradigms and concepts) not being automatable.
(Obviously you’re entitled to argue / believe that we don’t need need new AI paradigms and concepts to get to AGI! It’s a topic where I think reasonable people disagree. I’m just suggesting that it’s a necessary assumption for your argument to hang together, right?)
I disagree. I think the existing body of published computer science and neuroscience research are chock full of loose threads. Tons of potential innovations just waiting to be harvested by automated researchers. I’ve mentioned this idea elsewhere. I call it an ‘innovation overhang’. Simply testing interpolations and extrapolations (e.g. scaling up old forgotten ideas on modern hardware) seems highly likely to reveal plenty of successful new concepts, even if the hit rate per attempt is low. I think this means a better benchmark would consist of: taking two existing papers, finding a plausible hypothesis which combines the assumptions from the papers, designs and codes and runs tests, then reports on results.
So I don’t think “no new concepts” is a necessary assumption for getting to AGI quickly with the help of automated researchers.
Is this bottlenecked by programmer time or by compute cost?
Both? If you increase only one of the two the other becomes the bottleneck?
I agree this means that the decision to devote substantial compute to both inference and to assigning compute resources for running experiments designed by AI reseachers is a large cost. Presumably, as the competence of the AI reseachers gets higher, it feels easier to trust them not to waste their assigned experiment compute.
There was discussion on Dwarkesh Patel’s interview with researcher friends where there was mention that AI reseachers are already restricted by compute granted to them for experiments. Probably also on work hours per week they are allowed to spend on novel “off the main path” research.
So in order for there to be a big surge in AI R&D there’d need to be prioritization of that at a high level. This would be a change of direction from focusing primarily on scaling current techniques rapidly, and putting out slightly better products ASAP.
So yes, if you think that this priority shift won’t happen, then you should doubt that the increase in R&D speed my model predicts will occur.
But what would that world look like? Probably a world where scaling continues to pay dividends, and getting to AGI is more straightforward yhan Steve Byrnes or I expect.
I agree that that’s a substantial probability, but it’s also an AGI-soon sort of world.
I argue that for AGI to be not-soon, you need both scaling to fail and for algorithm research to fail.
My impression based on talking to people at labs plus stuff I’ve read is that
Most AI researchers have no trouble coming up with useful ways of spending all of the compute available to them
Most of the expense of hiring AI reseachers is compute costs for their experiments rather than salary
The big scaling labs try their best to hire the very best people they can get their hands on and concentrate their resources heavily into just a few teams, rather than trying to hire everyone with a pulse who can rub two tensors together.
(Very open to correction by people closer to the big scaling labs).
My model, then, says that compute availability is a constraint that binds much harder than programming or research ability, at least as things stand right now.
Sounds plausible to me. Especially since benchmarks encourage a focus on ability to hit the target at all rather than ability to either succeed or fail cheaply, which is what’s important in domains where the salary / electric bill of the experiment designer is an insignificant fraction of the total cost of the experiment.
Yeah, I expect it’s a matter of “dumb” scaling plus experimentation rather than any major new insights being needed. If scaling hits a wall that training on generated data + fine tuning + routing + specialization can’t overcome, I do agree that innovation becomes more important than iteration.
My model is not just “AGI-soon” but “the more permissive thresholds for when something should be considered AGI have already been met, and more such thresholds will fall in short order, and so we should stop asking when we will get AGI and start asking about when we will see each of the phenomena that we are using AGI as a proxy for”.
I think you’re mostly correct about current AI reseachers being able to usefully experiment with all the compute they have available.
I do think there are some considerations here though.
How closely are they adhering to the “main path” of scaling existing techniques with minor tweaks? If you want to know how a minor tweak affects your current large model at scale, that is a very compute-heavy researcher-time-light type of experiment. On the other hand, if you want to test a lot of novel new paths at much smaller scales, then you are in a relatively compute-light but researcher-time-heavy regime.
What fraction of the available compute resources is the company assigning to each of training/inference/experiments? My guess it that the current split is somewhere around 63/33/4. If this was true, and the company decided to pivot away from training to focus on experiments (0/33/67), this would be something like a 16x increase in compute for experiments. So maybe that changes the bottleneck?
We do indeed seem to be at “AGI for most stuff”, but with a spikey envelope of capability that leaves some dramatic failure modes. So it does make more sense to ask something like, “For remaining specific weakness X, what will the research agenda and timeline look like?”
This makes more sense then continuing to ask the vague “AGI complete” question when we are most of the way there already.
It sounds like your disagreement isn’t with drawing a link from RE-bench to (forecasts for) automating research engineering, but is instead with thinking that you can get AGI shortly after automating research engineering due to AI R&D acceleration and already being pretty close. Is that right?
Note that the comment says research engineering, not research scientists.
For context in a sibling comment Ryan said and Steven agreed with:
Now responding on whether I think the no new paradigms assumption is needed:
I generally have not been thinking in these sorts of binary terms but instead thinking in terms more like “Algorithmic progress research is moving at pace X today, if we had automated research engineers it would be sped up to N*X.” I’m not necessarily taking a stand on whether the progress will involve new paradigms or not, so I don’t think it requires an assumption of no new paradigms.
However:
If you think almost all new progress in some important sense will come from paradigm shifts, the forecasting method becomes weaker because the incremental progress doesn’t say as much about progress toward automated research engineering or AGI.
You might think that it’s more confusing than clarifying to think in terms of collapsing all research progress into a single “speed” and forecasting based on that.
Requiring a paradigm shift might lead to placing less weight on lower amounts of research effort required, and even if the probability distribution is the same what we should expect to see in the world leading up to AGI is not.
I’d also add that:
Regarding what research tasks I’m forecasting for the automated research engineer: REBench is not supposed to fully represent the tasks involved in actual research engineering. That’s why we have the gaps.
Regarding to what extent having an automated research engineer would speed up progress in worlds in which we need a paradigm shift: I think it’s hard to separate out conceptual from engineering/empirical work in terms of progress toward new paradigms. My guess would be being able to implement experiments very cheaply would substantially increase the expected number of paradigm shifts per unit time.
Thanks for this thoughtful reply!
In the framework of the argument, you seem to be objecting to premises 4-6. Specifically you seem to be saying “There’s another important gap between RE-bench saturation and completely automating AI R&D: new-paradigm-and-concept-generation. Perhaps we can speed up AI R&D by 5x or so without crossing this gap, simply by automating engineering, but to get to AGI we’ll need to cross this gap, and this gap might take a long time to cross even at 5x speed.”
(Is this a fair summary?)
If that’s what you are saying, I think I’d reply:
We already have a list of potential gaps, and this one seems to be a mediocre addition to the list IMO. I don’t think this distinction between old-paradigm/old-concepts and new-paradigm/new-concepts is going to hold up very well to philosophical inspection or continued ML progress; it smells similar to ye olde “do LLMs truly understand, or are they merely stochastic parrots?” and “Can they extrapolate, or do they merely interpolate?”
That said, I do think it’s worthy of being included on the list. I’m just not as excited about it as the other entries, especially (a) and (b).
I’d also say: What makes you think that this gap will take years to cross even at 5x speed? (i.e. even when algorithmic progress is 5x faster than it has been for the past decade) Do you have a positive argument, or is it just generic uncertainty / absence-of-evidence?
(For context: I work in the same org as Eli and basically agree with his argument above)
I think I’m objecting to (as Eli wrote) “collapsing all [AI] research progress into a single “speed” and forecasting based on that”. There can be different types of AI R&D, and we might be able to speed up some types without speeding up other types. For example, coming up with the AlphaGo paradigm (self-play, MCTS, ConvNets, etc.) or LLM paradigm (self-supervised pretraining, Transformers, etc.) is more foundational, whereas efficiently implementing and debugging a plan is less foundational. (Kinda “science vs engineering”?) I also sometimes use the example of Judea Pearl coming up with the belief prop algorithm in 1982. If everyone had tons of compute and automated research engineer assistants, would we have gotten belief prop earlier? I’m skeptical. As far as I understand: Belief prop was not waiting on compute. You can do belief prop on a 1960s mainframe. Heck, you can do belief prop on an abacus. Social scientists have been collecting data since the 1800s, and I imagine that belief prop would have been useful for analyzing at least some of that data, if only someone had invented it.
Indeed. Not only could belief prop have been invented in 1960, it was invented around 1960 (published 1962, “Low density parity check codes”, IRE Transactions on Information Theory) by Robert Gallager, as a decoding algorithm for error correcting codes.
I recognized that Gallager’s method was the same as Pearl’s belief propagation in 1996 (MacKay and Neal, ``Near Shannon limit performance of low density parity check codes″, Electronics Letters, vol. 33, pp. 457-458).
This says something about the ability of AI to potentially speed up research by simply linking known ideas (even if it’s not really AGI).
Came here to say this, got beaten to it by Radford Neal himself, wow! Well, I’m gonna comment anyway, even though it’s mostly been said.
Gallagher proposed belief propagation as an approximate good-enough method of decoding a certain error-correcting code, but didn’t notice that it worked on all sorts of probability problems. Pearl proposed it as a general mechanism for dealing with probability problems, but wanted perfect mathematical correctness, so confined himself to tree-shaped problems. It was their common generalization that was the real breakthrough: an approximate good-enough solution to all sorts of problems. Which is what Pearl eventually noticed, so props to him.
If we’d had AGI in the 1960s, someone with a probability problem could have said “Here’s my problem. For every paper in the literature, spawn an instance to read that paper and tell me if it has any help for my problem.” It would have found Gallagher’s paper and said “Maybe you could use this?”
I just wanted to add that this hypothesis, i.e.
…is parallel to what we see in other kinds of automation.
The technology of today has been much better at automating the production of clocks than the production of haircuts. Thus, 2024 technology is great at automating the production of some physical things but only slightly helpful for automating the production of some other physical things.
By the same token, different AI R&D projects are trying to “produce” different types of IP. Thus, it’s similarly possible that 2029 AI technology will be great at automating the production of some types of AI-related IP but only slightly helpful for automating the production of some other types of AI-related IP.
I disagree that there is a difference of kind between “engineering ingenuity” and “scientific discovery”, at least in the business of AI. The examples you give—self-play, MCTS, ConvNets—were all used in game-playing programs before AlphaGo. The trick of AlphaGo was to combine them, and then discover that it worked astonishingly well. It was very clever and tasteful engineering to combine them, but only a breakthrough in retrospect. And the people that developed them each earlier, for their independent purposes? They were part of the ordinary cycle of engineering development: “Look at a problem, think as hard as you can, come up with something, try it, publish the results.” They’re just the ones you remember, because they were good.
Paradigm shifts do happen, but I don’t think we need them between here and AGI.
Yeah I’m definitely describing something as a binary when it’s really a spectrum. (I was oversimplifying since I didn’t think it mattered for that particular context.)
In the context of AI, I don’t know what the difference is (if any) between engineering and science. You’re right that I was off-base there…
…But I do think that there’s a spectrum from ingenuity / insight to grunt-work.
So I’m bringing up a possible scenario where near-future AI gets progressively less useful as you move towards the ingenuity side of that spectrum, and where changing that situation (i.e., automating ingenuity) itself requires a lot of ingenuity, posing a chicken-and-egg problem / bottleneck that limits the scope of rapid near-future recursive AI progress.
Perhaps! Time will tell :)
I certainly agree that the collapse is a lossy abstraction / simplifies; in reality some domains of research will speed up more than 5x and others less than 5x, for example, even if we did get automated research engineers dropped on our heads tomorrow. Are you therefore arguing that in particular, the research needed to get to AGI is of the kind that won’t be sped up significantly? What’s the argument—that we need a new paradigm to get to AIs that can generate new paradigms, and being able to code really fast and well won’t majorly help us think of new paradigms? (I’d disagree with both sub-claims of that claim)
Yup! Although I’d say I’m “bringing up a possibility” rather than “arguing” in this particular thread. And I guess it depends on where we draw the line between “majorly” and “minorly” :)
This is clarifying for me, appreciate it. If I believed (a) that we needed a paradigm shift like the ones to LLMs in order to get AI systems resulting in substantial AI R&D speedup, and (b) that trend extrapolation from benchmark data would not be informative for predicting these paradigm shifts, then I would agree that the benchmarks + gaps method is not particularly informative.
Do you think that’s a fair summary of (this particular set of) necessary conditions?
(edit: didn’t see @Daniel Kokotajlo’s new comment before mine. I agree with him regarding disagreeing with both sub-claims but I think I have a sense of where you’re coming from.)
I find this kind of pattern-match pretty unconvincing without more object-level explanation. Why exactly do you think this distinction isn’t important? (I’m also not sure “Can they extrapolate, or do they merely interpolate?” qualifies as “ye olde,” still seems like a good question to me at least w.r.t. sufficiently out-of-distribution extrapolation.)
We are at an impasse then; I think basically I’m just the mirror of you. To me, the burden is on whoever thinks the distinction is important to explain why it matters. Current LLMs do many amazing things that many people—including AI experts—thought LLMs could never do due to architectural limitations. Recent history is full of examples of AI experts saying “LLMs are the offramp to AGI; they cannot do X; we need new paradigm to do X” and then a year or two later LLMs are doing X. So now I’m skeptical and would ask questions like: “Can you say more about this distinction—is it a binary, or a dimension? If it’s a dimension, how can we measure progress along it, and are we sure there hasn’t been significant progress on it already in the last few years, within the current paradigm? If there has indeed been no significant progress (as with ARC-AGI until 2024) is there another explanation for why that might be, besides your favored one (that your distinction is super important and that because of it a new paradigm is needed to get to AGI)”
The burden is on you because you’re saying “we have gone from not having the core algorithms for intelligence in our computers, to yes having them”.
https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce#The__no_blockers__intuition
And I think you’re admitting that your argument is “if we mush all capabilities together into one dimension, AI is moving up on that one dimension, so things will keep going up”.
Would you say the same thing about the invention of search engines? That was a huge jump in the capability of our computers. And it looks even more impressive if you blur out your vision—pretend you don’t know that the text that comes up on your screen is written by a humna, and pretend you don’t know that search is a specific kind of task distinct from a lot of other activity that would be involved in “True Understanding, woooo”—and just say “wow! previously our computers couldn’t write a poem, but now with just a few keystrokes my computer can literally produce Billy Collins level poetry!”.
Blurring things together at that level works for, like, macroeconomic trends. But if you look at macroeconomic trends it doesn’t say singularity in 2 years! Going to 2 or 10 years is an inside-view thing to conclude! You’re making some inference like “there’s an engine that is very likely operating here, that takes us to AGI in xyz years”.
I’m not saying that. You are the one who introduced the concept of “the core algorithms for intelligence;” you should explain what that means and why it’s a binary (or if it’s not a binary but rather a dimension, why we haven’t been moving along that dimension in recent past.
ETA: I do have an ontology, a way of thinking about these things, that is more sophisticated than simply mushing all capabilities together into one dimension. I just don’t accept your ontology yet.
(I might misunderstand you. My impression was that you’re saying it’s valid to extrapolate from “model XYZ does well at RE-Bench” to “model XYZ does well at developing new paradigms and concepts.” But maybe you’re saying that the trend of LLM success at various things suggests we don’t need new paradigms and concepts to get AGI in the first place? My reply below assumes the former:)
I’m not saying LLMs can’t develop new paradigms and concepts, though. The original claim you were responding to was that success at RE-Bench in particular doesn’t tell us much about success at developing new paradigms and concepts. “LLMs have done various things some people didn’t expect them to be able to do” doesn’t strike me as much of an argument against that.
More broadly, re: your burden of proof claim, I don’t buy that “LLMs have done various things some people didn’t expect them to be able to do” determinately pins down an extrapolation to “the current paradigm(s) will suffice for AGI, within 2-3 years.” That’s not a privileged reference class forecast, it’s a fairly specific prediction.
I feel like this sub-thread is going in circles; perhaps we should go back to the start of it. I said:
You replied:
Now, elsewhere in this comment section, various people (Carl, Radford) have jumped in to say the sorts of object-level things I also would have said if I were going to get into it. E.g. that old vs. new paradigm isn’t a binary but a spectrum, that automating research engineering WOULD actually speed up new-paradigm discovery, etc. What do you think of the points they made?
Also, I’m still waiting to hear answers to these questions: “Can you say more about this distinction—is it a binary, or a dimension? If it’s a dimension, how can we measure progress along it, and are we sure there hasn’t been significant progress on it already in the last few years, within the current paradigm? If there has indeed been no significant progress (as with ARC-AGI until 2024) is there another explanation for why that might be, besides your favored one (that your distinction is super important and that because of it a new paradigm is needed to get to AGI)”
@elifland what do you think is the strongest argument for long(er) timelines? Do you think it’s essentially just “it takes a long time for researchers learn how to cross the gaps”?
Or do you think there’s an entirely different frame (something that’s in an ontology that just looks very different from the one presented in the “benchmarks + gaps argument”?)
A few possible categories of situations we might have long timelines, off the top of my head:
Benchmarks + gaps is still best: overall gap is somewhat larger + slowdown in compute doubling time after 2028, but trend extrapolations still tell us something about gap trends: This is how I would most naturally think about how timelines through maybe the 2030s are achieved, and potentially beyond if neither of the next hold.
Others are best (more than of one of these can be true):
The current benchmarks and evaluations are so far away from AGI that trends on them don’t tell us anything (including regarding how fast gaps might be crossed). In this case one might want to identify the 1-2 most important gaps and reason about when we will cross these based on gears-level reasoning or trend extrapolation/forecasting on “real-world” data (e.g. revenue?) rather than trend extrapolation on benchmarks. Example candidate “gaps” that I often hear for these sorts of cases are the lack of feedback loops and the “long-tail of tasks” / reliability.
A paradigm shift in AGI training is needed and benchmark trends don’t tell us much about when we will achieve this (this is basically Steven’s sibling comment): in this case the best analysis might involve looking at the base rate of paradigm shifts per research effort, and/or looking at specific possible shifts.
^ this taxonomy is not comprehensive, just things I came up with quickly. Might be missing something that would be good.
To cop out answer your question, I feel like if I were making a long-timelines argument I’d argue that all 3 of those would be ways of forecasting to give weight to, then aggregate. If I had to choose just one I’d probably still go with (1) though.
edit: oh there’s also the “defer to AI experts” argument. I mostly try not to think about deference-based arguments because thinking on the object-level is more productive, though I think if I were really trying to make an all-things-considered timelines distribution there’s some chance I would adjust to longer due to deference arguments (but also some chance I’d adjust toward shorter, given that lots of people who have thought deeply about AGI / are close to the action have short timelines).
There’s also “base rate of super crazy things happening is low” style arguments which I don’t give much weight to.
Thanks. I think this argument assumes that the main bottleneck to AI progress is something like research engineering speed, such that accelerating research engineering speed would drastically increase AI progress?
I think that that makes sense as long as we are talking about domains like games / math / programming where you can automatically verify the results, but that something like speed of real-world interaction becomes the bottleneck once shifting to more open domains.
Consider an AI being trained on a task such as “acting as the CEO for a startup”. There may not be a way to do this training other than to have it actually run a real startup, and then wait for several years to see how the results turn out. Even after several years, it will be hard to say exactly which parts of the decision process contributed, and how much of the startup’s success or failure was due to random factors. Furthermore, during this process the AI will need to be closely monitored in order to make sure that it does not do anything illegal or grossly immoral, slowing down its decision process and thus whole the training. And I haven’t even mentioned the expense of a training run where running just a single trial requires a startup-level investment (assuming that the startup won’t pay back its investment, of course).
Of course, humans do not learn to be CEOs by running a million companies and then getting a reward signal at the end. Human CEOs come in with a number of skills that they have already learned from somewhere else that they then apply to the context of running a company, shifting between their existing skills and applying them as needed. However, the question of what kind of approach and skill to apply in what situation, and how to prioritize between different approaches, is by itself a skillset that needs to be learned… quite possibly through a lot of real-world feedback.