Here’s the structure of the argument that I am most compelled by (I call it the benchmarks + gaps argument), I’m uncertain about the details.
Focus on the endpoint of substantially speeding up AI R&D / automating research engineering. Let’s define our timelines endpoint as something that ~5xs the rate of AI R&D algorithmic progress (compared to a counterfactual world with no post-2024 AIs). Then make an argument that ~fully automating research engineering (experiment implementation/monitoring) would do this, along with research taste of at least the 50th percentile AGI company researcher (experiment ideation/selection).
Focus on REBench since it’s the most relevant benchmark. REBench is the most relevant benchmark here, for simplicity I’ll focus on only this though for robustness more benchmarks should be considered.
Based on trend extrapolation and benchmark base rates, roughly 50% we’ll saturate REBench by end of 2025.
Identify the most important gaps between saturating REBench and the endpoint defined in (1). The most important gaps between saturating REBench and achieving the 5xing AI R&D algorithmic progress are: (a) time horizon as measured by human time spent (b) tasks with worse feedback loops (c) tasks with large codebases (d) becoming significantly cheaper and/or faster than humans. There are some more but my best guess is that these 4 are the most important, should also take into account unknown gaps.
When forecasting the time to cross the gaps, it seems quite plausible that we get to the substantial AI R&D speedup within a few years after saturating REBench, so by end of 2028 (and significantly earlier doesn’t seem crazy).
This is the most important part of the argument, and one that I have lots of uncertainty over. We have some data regarding the “crossing speed” of some of the gaps but the data are quite limited at the moment. So there are a lot of judgment calls needed and people with strong long timelines intuitions might think the remaining gaps will take a long time to cross without this being close to falsified by our data.
This is broken down into “time to cross the gaps at 2024 pace of progress” → adjusting based on compute forecasts and intermediate AI R&D speedups before reaching 5x.
From substantial AI R&D speedup to AGI. Once we have the 5xing AIs, that’s potentially already AGI by some definitions but if you have a stronger one, the possibility of a somewhat fast takeoff means you might get it within a year or so after.
One reason I like this argument is that it will get much stronger over time as we get more difficult benchmarks and otherwise get more data about how quickly the gaps are being crossed.
I have a longer draft which makes this argument but it’s quite messy and incomplete and might not add much on top of the above summary for now. Unfortunately I’m prioritizing other workstreams over finishing this at the moment. DM me if you’d really like a link to the messy draft.
RE-bench tasks (see page 7 here) are not the kind of AI research where you’re developing new AI paradigms and concepts. The tasks are much more straightforward than that. So your argument is basically assuming without argument that we can get to AGI with just the more straightforward stuff, as opposed to new AI paradigms and concepts.
If we do need new AI paradigms and concepts to get to AGI, then there would be a chicken-and-egg problem in automating AI research. Or more specifically, there would be two categories of AI R&D, with the less important R&D category (e.g. performance optimization and other REbench-type tasks) being automatable by near-future AIs, and the more important R&D category (developing new AI paradigms and concepts) not being automatable.
(Obviously you’re entitled to argue / believe that we don’t need need new AI paradigms and concepts to get to AGI! It’s a topic where I think reasonable people disagree. I’m just suggesting that it’s a necessary assumption for your argument to hang together, right?)
I disagree. I think the existing body of published computer science and neuroscience research are chock full of loose threads. Tons of potential innovations just waiting to be harvested by automated researchers. I’ve mentioned this idea elsewhere. I call it an ‘innovation overhang’.
Simply testing interpolations and extrapolations (e.g. scaling up old forgotten ideas on modern hardware) seems highly likely to reveal plenty of successful new concepts, even if the hit rate per attempt is low.
I think this means a better benchmark would consist of: taking two existing papers, finding a plausible hypothesis which combines the assumptions from the papers, designs and codes and runs tests, then reports on results.
So I don’t think “no new concepts” is a necessary assumption for getting to AGI quickly with the help of automated researchers.
Simply testing interpolations and extrapolations (e.g. scaling up old forgotten ideas on modern hardware) seems highly likely to reveal plenty of successful new concepts, even if the hit rate per attempt is low
Is this bottlenecked by programmer time or by compute cost?
Both? If you increase only one of the two the other becomes the bottleneck?
I agree this means that the decision to devote substantial compute to both inference and to assigning compute resources for running experiments designed by AI reseachers is a large cost. Presumably, as the competence of the AI reseachers gets higher, it feels easier to trust them not to waste their assigned experiment compute.
There was discussion on Dwarkesh Patel’s interview with researcher friends where there was mention that AI reseachers are already restricted by compute granted to them for experiments. Probably also on work hours per week they are allowed to spend on novel “off the main path” research.
So in order for there to be a big surge in AI R&D there’d need to be prioritization of that at a high level. This would be a change of direction from focusing primarily on scaling current techniques rapidly, and putting out slightly better products ASAP.
So yes, if you think that this priority shift won’t happen, then you should doubt that the increase in R&D speed my model predicts will occur.
But what would that world look like? Probably a world where scaling continues to pay dividends, and getting to AGI is more straightforward yhan Steve Byrnes or I expect.
I agree that that’s a substantial probability, but it’s also an AGI-soon sort of world.
I argue that for AGI to be not-soon, you need both scaling to fail and for algorithm research to fail.
Both? If you increase only one of the two the other becomes the bottleneck?
My impression based on talking to people at labs plus stuff I’ve read is that
Most AI researchers have no trouble coming up with useful ways of spending all of the compute available to them
Most of the expense of hiring AI reseachers is compute costs for their experiments rather than salary
The big scaling labs try their best to hire the very best people they can get their hands on and concentrate their resources heavily into just a few teams, rather than trying to hire everyone with a pulse who can rub two tensors together.
(Very open to correction by people closer to the big scaling labs).
My model, then, says that compute availability is a constraint that binds much harder than programming or research ability, at least as things stand right now.
There was discussion on Dwarkesh Patel’s interview with researcher friends where there was mention that AI reseachers are already restricted by compute granted to them for experiments. Probably also on work hours per week they are allowed to spend on novel “off the main path” research.
Sounds plausible to me. Especially since benchmarks encourage a focus on ability to hit the target at all rather than ability to either succeed or fail cheaply, which is what’s important in domains where the salary / electric bill of the experiment designer is an insignificant fraction of the total cost of the experiment.
But what would that world look like? [...] I agree that that’s a substantial probability, but it’s also an AGI-soon sort of world.
Yeah, I expect it’s a matter of “dumb” scaling plus experimentation rather than any major new insights being needed. If scaling hits a wall that training on generated data + fine tuning + routing + specialization can’t overcome, I do agree that innovation becomes more important than iteration.
My model is not just “AGI-soon” but “the more permissive thresholds for when something should be considered AGI have already been met, and more such thresholds will fall in short order, and so we should stop asking when we will get AGI and start asking about when we will see each of the phenomena that we are using AGI as a proxy for”.
I think you’re mostly correct about current AI reseachers being able to usefully experiment with all the compute they have available.
I do think there are some considerations here though.
How closely are they adhering to the “main path” of scaling existing techniques with minor tweaks? If you want to know how a minor tweak affects your current large model at scale, that is a very compute-heavy researcher-time-light type of experiment. On the other hand, if you want to test a lot of novel new paths at much smaller scales, then you are in a relatively compute-light but researcher-time-heavy regime.
What fraction of the available compute resources is the company assigning to each of training/inference/experiments? My guess it that the current split is somewhere around 63/33/4. If this was true, and the company decided to pivot away from training to focus on experiments (0/33/67), this would be something like a 16x increase in compute for experiments. So maybe that changes the bottleneck?
We do indeed seem to be at “AGI for most stuff”, but with a spikey envelope of capability that leaves some dramatic failure modes. So it does make more sense to ask something like, “For remaining specific weakness X, what will the research agenda and timeline look like?”
This makes more sense then continuing to ask the vague “AGI complete” question when we are most of the way there already.
For context in a sibling comment Ryan said and Steven agreed with:
It sounds like your disagreement isn’t with drawing a link from RE-bench to (forecasts for) automating research engineering, but is instead with thinking that you can get AGI shortly after automating research engineering due to AI R&D acceleration and already being pretty close. Is that right?
Note that the comment says research engineering, not research scientists.
Now responding on whether I think the no new paradigms assumption is needed:
(Obviously you’re entitled to argue / believe that we don’t need need new AI paradigms and concepts to get to AGI! It’s a topic where I think reasonable people disagree. I’m just suggesting that it’s a necessary assumption for your argument to hang together, right?)
I generally have not been thinking in these sorts of binary terms but instead thinking in terms more like “Algorithmic progress research is moving at pace X today, if we had automated research engineers it would be sped up to N*X.” I’m not necessarily taking a stand on whether the progress will involve new paradigms or not, so I don’t think it requires an assumption of no new paradigms.
However:
If you think almost all new progress in some important sense will come from paradigm shifts, the forecasting method becomes weaker because the incremental progress doesn’t say as much about progress toward automated research engineering or AGI.
You might think that it’s more confusing than clarifying to think in terms of collapsing all research progress into a single “speed” and forecasting based on that.
Requiring a paradigm shift might lead to placing less weight on lower amounts of research effort required, and even if the probability distribution is the same what we should expect to see in the world leading up to AGI is not.
I’d also add that:
Regarding what research tasks I’m forecasting for the automated research engineer: REBench is not supposed to fully represent the tasks involved in actual research engineering. That’s why we have the gaps.
Regarding to what extent having an automated research engineer would speed up progress in worlds in which we need a paradigm shift: I think it’s hard to separate out conceptual from engineering/empirical work in terms of progress toward new paradigms. My guess would be being able to implement experiments very cheaply would substantially increase the expected number of paradigm shifts per unit time.
It sounds like your disagreement isn’t with drawing a link from RE-bench to (forecasts for) automating research engineering, but is instead with thinking that you can get AGI shortly after automating research engineering due to AI R&D acceleration and already being pretty close. Is that right?
Note that the comment says research engineering, not research scientists.
In the framework of the argument, you seem to be objecting to premises 4-6. Specifically you seem to be saying “There’s another important gap between RE-bench saturation and completely automating AI R&D: new-paradigm-and-concept-generation. Perhaps we can speed up AI R&D by 5x or so without crossing this gap, simply by automating engineering, but to get to AGI we’ll need to cross this gap, and this gap might take a long time to cross even at 5x speed.”
(Is this a fair summary?)
If that’s what you are saying, I think I’d reply:
We already have a list of potential gaps, and this one seems to be a mediocre addition to the list IMO. I don’t think this distinction between old-paradigm/old-concepts and new-paradigm/new-concepts is going to hold up very well to philosophical inspection or continued ML progress; it smells similar to ye olde “do LLMs truly understand, or are they merely stochastic parrots?” and “Can they extrapolate, or do they merely interpolate?”
That said, I do think it’s worthy of being included on the list. I’m just not as excited about it as the other entries, especially (a) and (b).
I’d also say: What makes you think that this gap will take years to cross even at 5x speed? (i.e. even when algorithmic progress is 5x faster than it has been for the past decade) Do you have a positive argument, or is it just generic uncertainty / absence-of-evidence?
(For context: I work in the same org as Eli and basically agree with his argument above)
I think I’m objecting to (as Eli wrote) “collapsing all [AI] research progress into a single “speed” and forecasting based on that”. There can be different types of AI R&D, and we might be able to speed up some types without speeding up other types. For example, coming up with the AlphaGo paradigm (self-play, MCTS, ConvNets, etc.) or LLM paradigm (self-supervised pretraining, Transformers, etc.) is more foundational, whereas efficiently implementing and debugging a plan is less foundational. (Kinda “science vs engineering”?) I also sometimes use the example of Judea Pearl coming up with the belief prop algorithm in 1982. If everyone had tons of compute and automated research engineer assistants, would we have gotten belief prop earlier? I’m skeptical. As far as I understand: Belief prop was not waiting on compute. You can do belief prop on a 1960s mainframe. Heck, you can do belief prop on an abacus. Social scientists have been collecting data since the 1800s, and I imagine that belief prop would have been useful for analyzing at least some of that data, if only someone had invented it.
Indeed. Not only could belief prop have been invented in 1960, it was invented around 1960 (published 1962, “Low density parity check codes”, IRE Transactions on Information Theory) by Robert Gallager, as a decoding algorithm for error correcting codes.
I recognized that Gallager’s method was the same as Pearl’s belief propagation in 1996 (MacKay and Neal, ``Near Shannon limit performance of low density parity check codes″, Electronics Letters, vol. 33, pp. 457-458).
This says something about the ability of AI to potentially speed up research by simply linking known ideas (even if it’s not really AGI).
Came here to say this, got beaten to it by Radford Neal himself, wow! Well, I’m gonna comment anyway, even though it’s mostly been said.
Gallagher proposed belief propagation as an approximate good-enough method of decoding a certain error-correcting code, but didn’t notice that it worked on all sorts of probability problems. Pearl proposed it as a general mechanism for dealing with probability problems, but wanted perfect mathematical correctness, so confined himself to tree-shaped problems. It was their common generalization that was the real breakthrough: an approximate good-enough solution to all sorts of problems. Which is what Pearl eventually noticed, so props to him.
If we’d had AGI in the 1960s, someone with a probability problem could have said “Here’s my problem. For every paper in the literature, spawn an instance to read that paper and tell me if it has any help for my problem.” It would have found Gallagher’s paper and said “Maybe you could use this?”
I think I’m objecting to (as Eli wrote) “collapsing all [AI] research progress into a single “speed” and forecasting based on that”. There can be different types of AI R&D, and we might be able to speed up some types without speeding up other types.
…is parallel to what we see in other kinds of automation.
The technology of today has been much better at automating the production of clocks than the production of haircuts. Thus, 2024 technology is great at automating the production of some physical things but only slightly helpful for automating the production of some other physical things.
By the same token, different AI R&D projects are trying to “produce” different types of IP. Thus, it’s similarly possible that 2029 AI technology will be great at automating the production of some types of AI-related IP but only slightly helpful for automating the production of some other types of AI-related IP.
I disagree that there is a difference of kind between “engineering ingenuity” and “scientific discovery”, at least in the business of AI. The examples you give—self-play, MCTS, ConvNets—were all used in game-playing programs before AlphaGo. The trick of AlphaGo was to combine them, and then discover that it worked astonishingly well. It was very clever and tasteful engineering to combine them, but only a breakthrough in retrospect. And the people that developed them each earlier, for their independent purposes? They were part of the ordinary cycle of engineering development: “Look at a problem, think as hard as you can, come up with something, try it, publish the results.” They’re just the ones you remember, because they were good.
Paradigm shifts do happen, but I don’t think we need them between here and AGI.
Yeah I’m definitely describing something as a binary when it’s really a spectrum. (I was oversimplifying since I didn’t think it mattered for that particular context.)
In the context of AI, I don’t know what the difference is (if any) between engineering and science. You’re right that I was off-base there…
…But I do think that there’s a spectrum from ingenuity / insight to grunt-work.
So I’m bringing up a possible scenario where near-future AI gets progressively less useful as you move towards the ingenuity side of that spectrum, and where changing that situation (i.e., automating ingenuity) itself requires a lot of ingenuity, posing a chicken-and-egg problem / bottleneck that limits the scope of rapid near-future recursive AI progress.
Paradigm shifts do happen, but I don’t think we need them between here and AGI.
I certainly agree that the collapse is a lossy abstraction / simplifies; in reality some domains of research will speed up more than 5x and others less than 5x, for example, even if we did get automated research engineers dropped on our heads tomorrow. Are you therefore arguing that in particular, the research needed to get to AGI is of the kind that won’t be sped up significantly? What’s the argument—that we need a new paradigm to get to AIs that can generate new paradigms, and being able to code really fast and well won’t majorly help us think of new paradigms? (I’d disagree with both sub-claims of that claim)
Are you therefore arguing that in particular, the research needed to get to AGI is of the kind that won’t be sped up significantly? What’s the argument—that we need a new paradigm to get to AIs that can generate new paradigms, and being able to code really fast and well won’t majorly help us think of new paradigms? (I’d disagree with both sub-claims of that claim)
Yup! Although I’d say I’m “bringing up a possibility” rather than “arguing” in this particular thread. And I guess it depends on where we draw the line between “majorly” and “minorly” :)
This is clarifying for me, appreciate it. If I believed (a) that we needed a paradigm shift like the ones to LLMs in order to get AI systems resulting in substantial AI R&D speedup, and (b) that trend extrapolation from benchmark data would not be informative for predicting these paradigm shifts, then I would agree that the benchmarks + gaps method is not particularly informative.
Do you think that’s a fair summary of (this particular set of) necessary conditions?
(edit: didn’t see @Daniel Kokotajlo’s new comment before mine. I agree with him regarding disagreeing with both sub-claims but I think I have a sense of where you’re coming from.)
I don’t think this distinction between old-paradigm/old-concepts and new-paradigm/new-concepts is going to hold up very well to philosophical inspection or continued ML progress; it smells similar to ye olde “do LLMs truly understand, or are they merely stochastic parrots?” and “Can they extrapolate, or do they merely interpolate?”
I find this kind of pattern-match pretty unconvincing without more object-level explanation. Why exactly do you think this distinction isn’t important? (I’m also not sure “Can they extrapolate, or do they merely interpolate?” qualifies as “ye olde,” still seems like a good question to me at least w.r.t. sufficiently out-of-distribution extrapolation.)
We are at an impasse then; I think basically I’m just the mirror of you. To me, the burden is on whoever thinks the distinction is important to explain why it matters. Current LLMs do many amazing things that many people—including AI experts—thought LLMs could never do due to architectural limitations. Recent history is full of examples of AI experts saying “LLMs are the offramp to AGI; they cannot do X; we need new paradigm to do X” and then a year or two later LLMs are doing X. So now I’m skeptical and would ask questions like: “Can you say more about this distinction—is it a binary, or a dimension? If it’s a dimension, how can we measure progress along it, and are we sure there hasn’t been significant progress on it already in the last few years, within the current paradigm? If there has indeed been no significant progress (as with ARC-AGI until 2024) is there another explanation for why that might be, besides your favored one (that your distinction is super important and that because of it a new paradigm is needed to get to AGI)”
And I think you’re admitting that your argument is “if we mush all capabilities together into one dimension, AI is moving up on that one dimension, so things will keep going up”.
Would you say the same thing about the invention of search engines? That was a huge jump in the capability of our computers. And it looks even more impressive if you blur out your vision—pretend you don’t know that the text that comes up on your screen is written by a humna, and pretend you don’t know that search is a specific kind of task distinct from a lot of other activity that would be involved in “True Understanding, woooo”—and just say “wow! previously our computers couldn’t write a poem, but now with just a few keystrokes my computer can literally produce Billy Collins level poetry!”.
Blurring things together at that level works for, like, macroeconomic trends. But if you look at macroeconomic trends it doesn’t say singularity in 2 years! Going to 2 or 10 years is an inside-view thing to conclude! You’re making some inference like “there’s an engine that is very likely operating here, that takes us to AGI in xyz years”.
I’m not saying that. You are the one who introduced the concept of “the core algorithms for intelligence;” you should explain what that means and why it’s a binary (or if it’s not a binary but rather a dimension, why we haven’t been moving along that dimension in recent past.
ETA: I do have an ontology, a way of thinking about these things, that is more sophisticated than simply mushing all capabilities together into one dimension. I just don’t accept your ontology yet.
(I might misunderstand you. My impression was that you’re saying it’s valid to extrapolate from “model XYZ does well at RE-Bench” to “model XYZ does well at developing new paradigms and concepts.” But maybe you’re saying that the trend of LLM success at various things suggests we don’t need new paradigms and concepts to get AGI in the first place? My reply below assumes the former:)
I’m not saying LLMs can’t develop new paradigms and concepts, though. The original claim you were responding to was that success at RE-Bench in particular doesn’t tell us much about success at developing new paradigms and concepts. “LLMs have done various things some people didn’t expect them to be able to do” doesn’t strike me as much of an argument against that.
More broadly, re: your burden of proof claim, I don’t buy that “LLMs have done various things some people didn’t expect them to be able to do” determinately pins down an extrapolation to “the current paradigm(s) will suffice for AGI, within 2-3 years.” That’s not a privileged reference class forecast, it’s a fairly specific prediction.
I feel like this sub-thread is going in circles; perhaps we should go back to the start of it. I said:
I don’t think this distinction between old-paradigm/old-concepts and new-paradigm/new-concepts is going to hold up very well to philosophical inspection or continued ML progress; it smells similar to ye olde “do LLMs truly understand, or are they merely stochastic parrots?” and “Can they extrapolate, or do they merely interpolate?”
You replied:
I find this kind of pattern-match pretty unconvincing without more object-level explanation. Why exactly do you think this distinction isn’t important? (I’m also not sure “Can they extrapolate, or do they merely interpolate?” qualifies as “ye olde,” still seems like a good question to me at least w.r.t. sufficiently out-of-distribution extrapolation.)
Now, elsewhere in this comment section, various people (Carl, Radford) have jumped in to say the sorts of object-level things I also would have said if I were going to get into it. E.g. that old vs. new paradigm isn’t a binary but a spectrum, that automating research engineering WOULD actually speed up new-paradigm discovery, etc. What do you think of the points they made?
Also, I’m still waiting to hear answers to these questions: “Can you say more about this distinction—is it a binary, or a dimension? If it’s a dimension, how can we measure progress along it, and are we sure there hasn’t been significant progress on it already in the last few years, within the current paradigm? If there has indeed been no significant progress (as with ARC-AGI until 2024) is there another explanation for why that might be, besides your favored one (that your distinction is super important and that because of it a new paradigm is needed to get to AGI)”
First, we know that labs are hill-climbing on benchmarks.
Obviously, this tends to inflate model performance on the specific benchmark tasks used for hill-climbing, relative to “similar” but non-benchmarked tasks.
More generally and insidiously, it tends to inflate performance on “the sort of things that are easy to measure with benchmarks,” relative to all other qualities that might be required to accelerate or replace various kinds of human labor.
If we suppose that amenability-to-benchmarking correlates with various other aspects of a given skill (which seems reasonable enough, “everything is correlated” after all), then we might expect that hill-climbing on a bunch of “easy to benchmark” tasks will induce generalization to other “easy to benchmark” tasks (even those that weren’t used for hill-climbing), without necessarily generalizing to tasks which are more difficult to measure.
For instance, perhaps hill-climbing on a variety of “difficult academic exam” tasks like GPQA will produce models that are very good at exam-like tasks in general, but which lag behind on various other skills which we would expect a human expert to possess if that human had similar exam scores to the model.
Anything that we can currently measure in a standardized, quantified way becomes a potential target for hill-climbing. These are the “benchmarks,” in the terms of your argument.
And anything we currently can’t (or simply don’t) measure well ends up as a “gap.” By definition, we don’t yet have clear quantitative visibility into how well we’re doing on the gaps, or how quickly we’re moving across them: if we did, then they would be “benchmarks” (and hill-climbing targets) rather than gaps.
It’s tempting here to try to forecast progress on the “gaps” by using recent progress on the “benchmarks” as a reference class. But this yields a biased estimate; we should expect average progress on “gaps” to be much slower than average progress on “benchmarks.”
The difference comes from the two factors I mentioned at the start:
Hill-climbing on a benchmark tends to improve that benchmark more than other things (including other, non-hill-climbed measures of the same underlying trait)
Benchmarks are – by definition – the things that are easy to measure, and thus to hill-climb.
Progress on such things is currently very fast, and presumably some of that speed owes to the rapid, quantitative, and inter-comparable feedback that benchmarks provide.
It’s not clear how much this kind of methodology generalizes to things that are important but inherently harder to measure. (How do you improve something if you can’t tell how good it is in the first place?)
Presumably things that are inherently harder to measure will improve more slowly – it’s harder to go fast when you’re “stumbling around in the dark” – and it’s difficult to know how big this effect is in advance.
I don’t get a sense that AI labs are taking this kind of thing very seriously at the moment (at least in their public communications, anyway). The general vibe I get is like, “we love working on improvements to measurable things, and everything we can measure gets better with scale, so presumably all the things we can’t measure will get solved by scale too; in the meantime we’ll work on hill-climbing the hills that are on our radar.”
If the unmeasured stuff were simply a random sample from the same distribution as the measured stuff, this approach would make sense, but we have no reason to believe this is the case. Is all this scaling and benchmark-chasing really lifting all boats, simultaneously? I mean, how would we know, right? By definition, we can’t measure what we can’t measure.
Or, more accurately, we can’t measure it in quantitative and observer-independent fashion. That doesn’t mean we don’t know it exists.
Indeed, some of this “dark matter” may well be utterly obvious when one is using the models in practice. It’s there, and as humans we can see it perfectly well, even if we would find it difficult to think up a good benchmark for it.
As LLMs get smarter – and as the claimed distance between them and “human experts” diminishes – I find that these “obvious yet difficult-to-quantify gaps” increasingly dominate my experience of LLMs as a user.
Current frontier models are, in some sense, “much better at me than coding.” In a formal coding competition I would obviously lose to these things; I might well perform worse at more “real-world” stuff like SWE-Bench Verified, too.
Among humans with similar scores on coding and math benchmarks, many (if not all) of them would be better at my job than I am, and fully capable of replacing me as an employee. Yet the models are not capable of this.
Claude-3.7-Sonnet really does have remarkable programming skills (even by human standards), but it can’t adequately do my job – not even for a single day, or (I would expect) for a single hour. I can use it effectively to automate certain aspects of my work, but it needs constant handholding, and that’s when it’s on the fairly narrow rails of something like Cursor rather than in the messy, open-ended “agentic environment” that is the real workplace.
What is it missing? I don’t know, it’s hard to state precisely. (If it were easier to state precisely, it would be a “benchmark” rather than a “gap” and we’d be having a very different conversation right now.)
Something like, I dunno… “taste”? “Agency”?
“Being able to look at a messy real-world situation and determine what’s important and what’s not, rather than treating everything like some sort of school exam?”
“Talking through the problem like a coworker, rather than barreling forward with your best guess about what the nonexistent teacher will give you good marks for doing?”
“Acting like a curious experimenter, not a helpful-and-harmless pseudo-expert who already knows the right answer?”
“(Or, for that matter, acting like an RL ‘reasoning’ system awkwardly bolted on to an existing HHH chatbot, with a verbose CoT side-stream that endlessly speculates about ‘what the user might have really meant’ every time I say something unclear rather than just fucking asking me like any normal person would?)”
If you use LLMs to do serious work, these kinds of bottlenecks become apparent very fast.
Scaling up training on “difficult academic exam”-type tasks is not going to remove the things that prevent the LLM from doing my job. I don’t know what those things are, exactly, but I do know that the problem is not “insufficient skill at impressive-looking ‘expert’ benchmark tasks.” Why? Because the model is already way better than me at difficult academic tests, and yet – it still can’t autonomously do my job, or yours, or (to a first approximation) anyone else’s.
On GPQA — a benchmark of Ph.D-level science questions — GPT-4 performed marginally better than random guessing. 18 months later, the best reasoning models outperform PhD-level experts.
Well, that certainly sounds impressive. Certainly something happened here. But what, exactly?
If you showed this line to someone who knew nothing about the context, I imagine they would (A) vastly overestimate the usefulness of current models as academic research assistants, and (B) vastly underestimate the usefulness of GPT-4 in the same role.
GPT-4 already knew all kinds of science facts of the sort that GPQA tests, even if it didn’t know them quite as well, or wasn’t as readily able to integrate them in the exact way that GPQA expects (that’s hill-climbing for you).
What was lacking was not mainly the knowledge itself – GPT-4 was already incredibly good at obscure book-learning! – but all the… other stuff involved in competent research assistance. The dark matter, the soft skills, the unmesaurables, the gaps. The kind of thing I was talking about just a moment ago. “Taste,” or “agency,” or “acting like you have real-world experience rather than just being a child prodigy who’s really good at exams.”
And the newer models don’t have that stuff either. They can “do” more things if you give them constant handholding, but they still need that hand-holding; they still can’t apply common sense to reason their way through situations that don’t resemble a school exam or an interaction with a gormless ChatGPT user in search of a clean, decontextualized helpful-and-harmless “answer.” If they were people, I would not want to hire them, any more than I’d want to hire GPT-4.
If (as I claim) all this “dark matter” is not improving much, then we are not going to get a self-improvement loop unless
It turns out that models without these abilities can bootstrap their way into having them
Labs start taking the “dark matter” much more seriously than they have so far, rather than just hill-climbing easily measurable things and leaning on scaling and RSI for everything else
I doubt that (1) will hold: the qualities that are missing are closely related to things like “ability to act without supervision” and “research/design/engineering taste” that seem very important for self-improvement.
As for (2), well, my best guess is that we’ll have to wait until ~2027-2028, at which point it will become clear that the “just scale and hill-climb and increasingly defer to your HHH assistant” approach somehow didn’t work – and then, at last, we’ll start seeing serious attempts to succeed at the unmeasurable.
I think the labs might well be rational in focusing on this sort of “handheld automation”, just to enable their researchers to code experiments faster and in smaller teams.
My mental model of AI R&D is that it can be bottlenecked roughly by three things: compute, engineering time, and the “dark matter” of taste and feedback loops on messy research results. I can certainly imagine a model of lab productivity where the best way to accelerate is improving handheld automation for the entirety of 2025. Say, the core paradigm is fixed; but inside that paradigm, the research team has more promising ideas than they have time to implement and try out on smaller-scale experiments; and they really do not want to hire more people.
If you consider the AI lab as a fundamental unit that wants to increase its velocity, and works on things that make models faster, it’s plausible they can be aware how bad the model performance is on research taste, and still not be making a mistake by ignoring your “dark matter” right now. They will work on it when they are faster.
@elifland what do you think is the strongest argument for long(er) timelines? Do you think it’s essentially just “it takes a long time for researchers learn how to cross the gaps”?
Or do you think there’s an entirely different frame (something that’s in an ontology that just looks very different from the one presented in the “benchmarks + gaps argument”?)
A few possible categories of situations we might have long timelines, off the top of my head:
Benchmarks + gaps is still best: overall gap is somewhat larger + slowdown in compute doubling time after 2028, but trend extrapolations still tell us something about gap trends: This is how I would most naturally think about how timelines through maybe the 2030s are achieved, and potentially beyond if neither of the next hold.
Others are best (more than of one of these can be true):
The current benchmarks and evaluations are so far away from AGI that trends on them don’t tell us anything (including regarding how fast gaps might be crossed). In this case one might want to identify the 1-2 most important gaps and reason about when we will cross these based on gears-level reasoning or trend extrapolation/forecasting on “real-world” data (e.g. revenue?) rather than trend extrapolation on benchmarks. Example candidate “gaps” that I often hear for these sorts of cases are the lack of feedback loops and the “long-tail of tasks” / reliability.
A paradigm shift in AGI training is needed and benchmark trends don’t tell us much about when we will achieve this (this is basically Steven’s sibling comment): in this case the best analysis might involve looking at the base rate of paradigm shifts per research effort, and/or looking at specific possible shifts.
^ this taxonomy is not comprehensive, just things I came up with quickly. Might be missing something that would be good.
To cop out answer your question, I feel like if I were making a long-timelines argument I’d argue that all 3 of those would be ways of forecasting to give weight to, then aggregate. If I had to choose just one I’d probably still go with (1) though.
edit: oh there’s also the “defer to AI experts” argument. I mostly try not to think about deference-based arguments because thinking on the object-level is more productive, though I think if I were really trying to make an all-things-considered timelines distribution there’s some chance I would adjust to longer due to deference arguments (but also some chance I’d adjust toward shorter, given that lots of people who have thought deeply about AGI / are close to the action have short timelines).
There’s also “base rate of super crazy things happening is low” style arguments which I don’t give much weight to.
Thanks. I think this argument assumes that the main bottleneck to AI progress is something like research engineering speed, such that accelerating research engineering speed would drastically increase AI progress?
I think that that makes sense as long as we are talking about domains like games / math / programming where you can automatically verify the results, but that something like speed of real-world interaction becomes the bottleneck once shifting to more open domains.
Consider an AI being trained on a task such as “acting as the CEO for a startup”. There may not be a way to do this training other than to have it actually run a real startup, and then wait for several years to see how the results turn out. Even after several years, it will be hard to say exactly which parts of the decision process contributed, and how much of the startup’s success or failure was due to random factors. Furthermore, during this process the AI will need to be closely monitored in order to make sure that it does not do anything illegal or grossly immoral, slowing down its decision process and thus whole the training. And I haven’t even mentioned the expense of a training run where running just a single trial requires a startup-level investment (assuming that the startup won’t pay back its investment, of course).
Of course, humans do not learn to be CEOs by running a million companies and then getting a reward signal at the end. Human CEOs come in with a number of skills that they have already learned from somewhere else that they then apply to the context of running a company, shifting between their existing skills and applying them as needed. However, the question of what kind of approach and skill to apply in what situation, and how to prioritize between different approaches, is by itself a skillset that needs to be learned… quite possibly through a lot of real-world feedback.
Here’s the structure of the argument that I am most compelled by (I call it the benchmarks + gaps argument), I’m uncertain about the details.
Focus on the endpoint of substantially speeding up AI R&D / automating research engineering. Let’s define our timelines endpoint as something that ~5xs the rate of AI R&D algorithmic progress (compared to a counterfactual world with no post-2024 AIs). Then make an argument that ~fully automating research engineering (experiment implementation/monitoring) would do this, along with research taste of at least the 50th percentile AGI company researcher (experiment ideation/selection).
Focus on REBench since it’s the most relevant benchmark. REBench is the most relevant benchmark here, for simplicity I’ll focus on only this though for robustness more benchmarks should be considered.
Based on trend extrapolation and benchmark base rates, roughly 50% we’ll saturate REBench by end of 2025.
Identify the most important gaps between saturating REBench and the endpoint defined in (1). The most important gaps between saturating REBench and achieving the 5xing AI R&D algorithmic progress are: (a) time horizon as measured by human time spent (b) tasks with worse feedback loops (c) tasks with large codebases (d) becoming significantly cheaper and/or faster than humans. There are some more but my best guess is that these 4 are the most important, should also take into account unknown gaps.
When forecasting the time to cross the gaps, it seems quite plausible that we get to the substantial AI R&D speedup within a few years after saturating REBench, so by end of 2028 (and significantly earlier doesn’t seem crazy).
This is the most important part of the argument, and one that I have lots of uncertainty over. We have some data regarding the “crossing speed” of some of the gaps but the data are quite limited at the moment. So there are a lot of judgment calls needed and people with strong long timelines intuitions might think the remaining gaps will take a long time to cross without this being close to falsified by our data.
This is broken down into “time to cross the gaps at 2024 pace of progress” → adjusting based on compute forecasts and intermediate AI R&D speedups before reaching 5x.
From substantial AI R&D speedup to AGI. Once we have the 5xing AIs, that’s potentially already AGI by some definitions but if you have a stronger one, the possibility of a somewhat fast takeoff means you might get it within a year or so after.
One reason I like this argument is that it will get much stronger over time as we get more difficult benchmarks and otherwise get more data about how quickly the gaps are being crossed.
I have a longer draft which makes this argument but it’s quite messy and incomplete and might not add much on top of the above summary for now. Unfortunately I’m prioritizing other workstreams over finishing this at the moment. DM me if you’d really like a link to the messy draft.
RE-bench tasks (see page 7 here) are not the kind of AI research where you’re developing new AI paradigms and concepts. The tasks are much more straightforward than that. So your argument is basically assuming without argument that we can get to AGI with just the more straightforward stuff, as opposed to new AI paradigms and concepts.
If we do need new AI paradigms and concepts to get to AGI, then there would be a chicken-and-egg problem in automating AI research. Or more specifically, there would be two categories of AI R&D, with the less important R&D category (e.g. performance optimization and other REbench-type tasks) being automatable by near-future AIs, and the more important R&D category (developing new AI paradigms and concepts) not being automatable.
(Obviously you’re entitled to argue / believe that we don’t need need new AI paradigms and concepts to get to AGI! It’s a topic where I think reasonable people disagree. I’m just suggesting that it’s a necessary assumption for your argument to hang together, right?)
I disagree. I think the existing body of published computer science and neuroscience research are chock full of loose threads. Tons of potential innovations just waiting to be harvested by automated researchers. I’ve mentioned this idea elsewhere. I call it an ‘innovation overhang’. Simply testing interpolations and extrapolations (e.g. scaling up old forgotten ideas on modern hardware) seems highly likely to reveal plenty of successful new concepts, even if the hit rate per attempt is low. I think this means a better benchmark would consist of: taking two existing papers, finding a plausible hypothesis which combines the assumptions from the papers, designs and codes and runs tests, then reports on results.
So I don’t think “no new concepts” is a necessary assumption for getting to AGI quickly with the help of automated researchers.
Is this bottlenecked by programmer time or by compute cost?
Both? If you increase only one of the two the other becomes the bottleneck?
I agree this means that the decision to devote substantial compute to both inference and to assigning compute resources for running experiments designed by AI reseachers is a large cost. Presumably, as the competence of the AI reseachers gets higher, it feels easier to trust them not to waste their assigned experiment compute.
There was discussion on Dwarkesh Patel’s interview with researcher friends where there was mention that AI reseachers are already restricted by compute granted to them for experiments. Probably also on work hours per week they are allowed to spend on novel “off the main path” research.
So in order for there to be a big surge in AI R&D there’d need to be prioritization of that at a high level. This would be a change of direction from focusing primarily on scaling current techniques rapidly, and putting out slightly better products ASAP.
So yes, if you think that this priority shift won’t happen, then you should doubt that the increase in R&D speed my model predicts will occur.
But what would that world look like? Probably a world where scaling continues to pay dividends, and getting to AGI is more straightforward yhan Steve Byrnes or I expect.
I agree that that’s a substantial probability, but it’s also an AGI-soon sort of world.
I argue that for AGI to be not-soon, you need both scaling to fail and for algorithm research to fail.
My impression based on talking to people at labs plus stuff I’ve read is that
Most AI researchers have no trouble coming up with useful ways of spending all of the compute available to them
Most of the expense of hiring AI reseachers is compute costs for their experiments rather than salary
The big scaling labs try their best to hire the very best people they can get their hands on and concentrate their resources heavily into just a few teams, rather than trying to hire everyone with a pulse who can rub two tensors together.
(Very open to correction by people closer to the big scaling labs).
My model, then, says that compute availability is a constraint that binds much harder than programming or research ability, at least as things stand right now.
Sounds plausible to me. Especially since benchmarks encourage a focus on ability to hit the target at all rather than ability to either succeed or fail cheaply, which is what’s important in domains where the salary / electric bill of the experiment designer is an insignificant fraction of the total cost of the experiment.
Yeah, I expect it’s a matter of “dumb” scaling plus experimentation rather than any major new insights being needed. If scaling hits a wall that training on generated data + fine tuning + routing + specialization can’t overcome, I do agree that innovation becomes more important than iteration.
My model is not just “AGI-soon” but “the more permissive thresholds for when something should be considered AGI have already been met, and more such thresholds will fall in short order, and so we should stop asking when we will get AGI and start asking about when we will see each of the phenomena that we are using AGI as a proxy for”.
I think you’re mostly correct about current AI reseachers being able to usefully experiment with all the compute they have available.
I do think there are some considerations here though.
How closely are they adhering to the “main path” of scaling existing techniques with minor tweaks? If you want to know how a minor tweak affects your current large model at scale, that is a very compute-heavy researcher-time-light type of experiment. On the other hand, if you want to test a lot of novel new paths at much smaller scales, then you are in a relatively compute-light but researcher-time-heavy regime.
What fraction of the available compute resources is the company assigning to each of training/inference/experiments? My guess it that the current split is somewhere around 63/33/4. If this was true, and the company decided to pivot away from training to focus on experiments (0/33/67), this would be something like a 16x increase in compute for experiments. So maybe that changes the bottleneck?
We do indeed seem to be at “AGI for most stuff”, but with a spikey envelope of capability that leaves some dramatic failure modes. So it does make more sense to ask something like, “For remaining specific weakness X, what will the research agenda and timeline look like?”
This makes more sense then continuing to ask the vague “AGI complete” question when we are most of the way there already.
For context in a sibling comment Ryan said and Steven agreed with:
Now responding on whether I think the no new paradigms assumption is needed:
I generally have not been thinking in these sorts of binary terms but instead thinking in terms more like “Algorithmic progress research is moving at pace X today, if we had automated research engineers it would be sped up to N*X.” I’m not necessarily taking a stand on whether the progress will involve new paradigms or not, so I don’t think it requires an assumption of no new paradigms.
However:
If you think almost all new progress in some important sense will come from paradigm shifts, the forecasting method becomes weaker because the incremental progress doesn’t say as much about progress toward automated research engineering or AGI.
You might think that it’s more confusing than clarifying to think in terms of collapsing all research progress into a single “speed” and forecasting based on that.
Requiring a paradigm shift might lead to placing less weight on lower amounts of research effort required, and even if the probability distribution is the same what we should expect to see in the world leading up to AGI is not.
I’d also add that:
Regarding what research tasks I’m forecasting for the automated research engineer: REBench is not supposed to fully represent the tasks involved in actual research engineering. That’s why we have the gaps.
Regarding to what extent having an automated research engineer would speed up progress in worlds in which we need a paradigm shift: I think it’s hard to separate out conceptual from engineering/empirical work in terms of progress toward new paradigms. My guess would be being able to implement experiments very cheaply would substantially increase the expected number of paradigm shifts per unit time.
It sounds like your disagreement isn’t with drawing a link from RE-bench to (forecasts for) automating research engineering, but is instead with thinking that you can get AGI shortly after automating research engineering due to AI R&D acceleration and already being pretty close. Is that right?
Note that the comment says research engineering, not research scientists.
Thanks for this thoughtful reply!
In the framework of the argument, you seem to be objecting to premises 4-6. Specifically you seem to be saying “There’s another important gap between RE-bench saturation and completely automating AI R&D: new-paradigm-and-concept-generation. Perhaps we can speed up AI R&D by 5x or so without crossing this gap, simply by automating engineering, but to get to AGI we’ll need to cross this gap, and this gap might take a long time to cross even at 5x speed.”
(Is this a fair summary?)
If that’s what you are saying, I think I’d reply:
We already have a list of potential gaps, and this one seems to be a mediocre addition to the list IMO. I don’t think this distinction between old-paradigm/old-concepts and new-paradigm/new-concepts is going to hold up very well to philosophical inspection or continued ML progress; it smells similar to ye olde “do LLMs truly understand, or are they merely stochastic parrots?” and “Can they extrapolate, or do they merely interpolate?”
That said, I do think it’s worthy of being included on the list. I’m just not as excited about it as the other entries, especially (a) and (b).
I’d also say: What makes you think that this gap will take years to cross even at 5x speed? (i.e. even when algorithmic progress is 5x faster than it has been for the past decade) Do you have a positive argument, or is it just generic uncertainty / absence-of-evidence?
(For context: I work in the same org as Eli and basically agree with his argument above)
I think I’m objecting to (as Eli wrote) “collapsing all [AI] research progress into a single “speed” and forecasting based on that”. There can be different types of AI R&D, and we might be able to speed up some types without speeding up other types. For example, coming up with the AlphaGo paradigm (self-play, MCTS, ConvNets, etc.) or LLM paradigm (self-supervised pretraining, Transformers, etc.) is more foundational, whereas efficiently implementing and debugging a plan is less foundational. (Kinda “science vs engineering”?) I also sometimes use the example of Judea Pearl coming up with the belief prop algorithm in 1982. If everyone had tons of compute and automated research engineer assistants, would we have gotten belief prop earlier? I’m skeptical. As far as I understand: Belief prop was not waiting on compute. You can do belief prop on a 1960s mainframe. Heck, you can do belief prop on an abacus. Social scientists have been collecting data since the 1800s, and I imagine that belief prop would have been useful for analyzing at least some of that data, if only someone had invented it.
Indeed. Not only could belief prop have been invented in 1960, it was invented around 1960 (published 1962, “Low density parity check codes”, IRE Transactions on Information Theory) by Robert Gallager, as a decoding algorithm for error correcting codes.
I recognized that Gallager’s method was the same as Pearl’s belief propagation in 1996 (MacKay and Neal, ``Near Shannon limit performance of low density parity check codes″, Electronics Letters, vol. 33, pp. 457-458).
This says something about the ability of AI to potentially speed up research by simply linking known ideas (even if it’s not really AGI).
Came here to say this, got beaten to it by Radford Neal himself, wow! Well, I’m gonna comment anyway, even though it’s mostly been said.
Gallagher proposed belief propagation as an approximate good-enough method of decoding a certain error-correcting code, but didn’t notice that it worked on all sorts of probability problems. Pearl proposed it as a general mechanism for dealing with probability problems, but wanted perfect mathematical correctness, so confined himself to tree-shaped problems. It was their common generalization that was the real breakthrough: an approximate good-enough solution to all sorts of problems. Which is what Pearl eventually noticed, so props to him.
If we’d had AGI in the 1960s, someone with a probability problem could have said “Here’s my problem. For every paper in the literature, spawn an instance to read that paper and tell me if it has any help for my problem.” It would have found Gallagher’s paper and said “Maybe you could use this?”
I just wanted to add that this hypothesis, i.e.
…is parallel to what we see in other kinds of automation.
The technology of today has been much better at automating the production of clocks than the production of haircuts. Thus, 2024 technology is great at automating the production of some physical things but only slightly helpful for automating the production of some other physical things.
By the same token, different AI R&D projects are trying to “produce” different types of IP. Thus, it’s similarly possible that 2029 AI technology will be great at automating the production of some types of AI-related IP but only slightly helpful for automating the production of some other types of AI-related IP.
I disagree that there is a difference of kind between “engineering ingenuity” and “scientific discovery”, at least in the business of AI. The examples you give—self-play, MCTS, ConvNets—were all used in game-playing programs before AlphaGo. The trick of AlphaGo was to combine them, and then discover that it worked astonishingly well. It was very clever and tasteful engineering to combine them, but only a breakthrough in retrospect. And the people that developed them each earlier, for their independent purposes? They were part of the ordinary cycle of engineering development: “Look at a problem, think as hard as you can, come up with something, try it, publish the results.” They’re just the ones you remember, because they were good.
Paradigm shifts do happen, but I don’t think we need them between here and AGI.
Yeah I’m definitely describing something as a binary when it’s really a spectrum. (I was oversimplifying since I didn’t think it mattered for that particular context.)
In the context of AI, I don’t know what the difference is (if any) between engineering and science. You’re right that I was off-base there…
…But I do think that there’s a spectrum from ingenuity / insight to grunt-work.
So I’m bringing up a possible scenario where near-future AI gets progressively less useful as you move towards the ingenuity side of that spectrum, and where changing that situation (i.e., automating ingenuity) itself requires a lot of ingenuity, posing a chicken-and-egg problem / bottleneck that limits the scope of rapid near-future recursive AI progress.
Perhaps! Time will tell :)
I certainly agree that the collapse is a lossy abstraction / simplifies; in reality some domains of research will speed up more than 5x and others less than 5x, for example, even if we did get automated research engineers dropped on our heads tomorrow. Are you therefore arguing that in particular, the research needed to get to AGI is of the kind that won’t be sped up significantly? What’s the argument—that we need a new paradigm to get to AIs that can generate new paradigms, and being able to code really fast and well won’t majorly help us think of new paradigms? (I’d disagree with both sub-claims of that claim)
Yup! Although I’d say I’m “bringing up a possibility” rather than “arguing” in this particular thread. And I guess it depends on where we draw the line between “majorly” and “minorly” :)
This is clarifying for me, appreciate it. If I believed (a) that we needed a paradigm shift like the ones to LLMs in order to get AI systems resulting in substantial AI R&D speedup, and (b) that trend extrapolation from benchmark data would not be informative for predicting these paradigm shifts, then I would agree that the benchmarks + gaps method is not particularly informative.
Do you think that’s a fair summary of (this particular set of) necessary conditions?
(edit: didn’t see @Daniel Kokotajlo’s new comment before mine. I agree with him regarding disagreeing with both sub-claims but I think I have a sense of where you’re coming from.)
I find this kind of pattern-match pretty unconvincing without more object-level explanation. Why exactly do you think this distinction isn’t important? (I’m also not sure “Can they extrapolate, or do they merely interpolate?” qualifies as “ye olde,” still seems like a good question to me at least w.r.t. sufficiently out-of-distribution extrapolation.)
We are at an impasse then; I think basically I’m just the mirror of you. To me, the burden is on whoever thinks the distinction is important to explain why it matters. Current LLMs do many amazing things that many people—including AI experts—thought LLMs could never do due to architectural limitations. Recent history is full of examples of AI experts saying “LLMs are the offramp to AGI; they cannot do X; we need new paradigm to do X” and then a year or two later LLMs are doing X. So now I’m skeptical and would ask questions like: “Can you say more about this distinction—is it a binary, or a dimension? If it’s a dimension, how can we measure progress along it, and are we sure there hasn’t been significant progress on it already in the last few years, within the current paradigm? If there has indeed been no significant progress (as with ARC-AGI until 2024) is there another explanation for why that might be, besides your favored one (that your distinction is super important and that because of it a new paradigm is needed to get to AGI)”
The burden is on you because you’re saying “we have gone from not having the core algorithms for intelligence in our computers, to yes having them”.
https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce#The__no_blockers__intuition
And I think you’re admitting that your argument is “if we mush all capabilities together into one dimension, AI is moving up on that one dimension, so things will keep going up”.
Would you say the same thing about the invention of search engines? That was a huge jump in the capability of our computers. And it looks even more impressive if you blur out your vision—pretend you don’t know that the text that comes up on your screen is written by a humna, and pretend you don’t know that search is a specific kind of task distinct from a lot of other activity that would be involved in “True Understanding, woooo”—and just say “wow! previously our computers couldn’t write a poem, but now with just a few keystrokes my computer can literally produce Billy Collins level poetry!”.
Blurring things together at that level works for, like, macroeconomic trends. But if you look at macroeconomic trends it doesn’t say singularity in 2 years! Going to 2 or 10 years is an inside-view thing to conclude! You’re making some inference like “there’s an engine that is very likely operating here, that takes us to AGI in xyz years”.
I’m not saying that. You are the one who introduced the concept of “the core algorithms for intelligence;” you should explain what that means and why it’s a binary (or if it’s not a binary but rather a dimension, why we haven’t been moving along that dimension in recent past.
ETA: I do have an ontology, a way of thinking about these things, that is more sophisticated than simply mushing all capabilities together into one dimension. I just don’t accept your ontology yet.
(I might misunderstand you. My impression was that you’re saying it’s valid to extrapolate from “model XYZ does well at RE-Bench” to “model XYZ does well at developing new paradigms and concepts.” But maybe you’re saying that the trend of LLM success at various things suggests we don’t need new paradigms and concepts to get AGI in the first place? My reply below assumes the former:)
I’m not saying LLMs can’t develop new paradigms and concepts, though. The original claim you were responding to was that success at RE-Bench in particular doesn’t tell us much about success at developing new paradigms and concepts. “LLMs have done various things some people didn’t expect them to be able to do” doesn’t strike me as much of an argument against that.
More broadly, re: your burden of proof claim, I don’t buy that “LLMs have done various things some people didn’t expect them to be able to do” determinately pins down an extrapolation to “the current paradigm(s) will suffice for AGI, within 2-3 years.” That’s not a privileged reference class forecast, it’s a fairly specific prediction.
I feel like this sub-thread is going in circles; perhaps we should go back to the start of it. I said:
You replied:
Now, elsewhere in this comment section, various people (Carl, Radford) have jumped in to say the sorts of object-level things I also would have said if I were going to get into it. E.g. that old vs. new paradigm isn’t a binary but a spectrum, that automating research engineering WOULD actually speed up new-paradigm discovery, etc. What do you think of the points they made?
Also, I’m still waiting to hear answers to these questions: “Can you say more about this distinction—is it a binary, or a dimension? If it’s a dimension, how can we measure progress along it, and are we sure there hasn’t been significant progress on it already in the last few years, within the current paradigm? If there has indeed been no significant progress (as with ARC-AGI until 2024) is there another explanation for why that might be, besides your favored one (that your distinction is super important and that because of it a new paradigm is needed to get to AGI)”
Here’s why I’m wary of this kind of argument:
First, we know that labs are hill-climbing on benchmarks.
Obviously, this tends to inflate model performance on the specific benchmark tasks used for hill-climbing, relative to “similar” but non-benchmarked tasks.
More generally and insidiously, it tends to inflate performance on “the sort of things that are easy to measure with benchmarks,” relative to all other qualities that might be required to accelerate or replace various kinds of human labor.
If we suppose that amenability-to-benchmarking correlates with various other aspects of a given skill (which seems reasonable enough, “everything is correlated” after all), then we might expect that hill-climbing on a bunch of “easy to benchmark” tasks will induce generalization to other “easy to benchmark” tasks (even those that weren’t used for hill-climbing), without necessarily generalizing to tasks which are more difficult to measure.
For instance, perhaps hill-climbing on a variety of “difficult academic exam” tasks like GPQA will produce models that are very good at exam-like tasks in general, but which lag behind on various other skills which we would expect a human expert to possess if that human had similar exam scores to the model.
Anything that we can currently measure in a standardized, quantified way becomes a potential target for hill-climbing. These are the “benchmarks,” in the terms of your argument.
And anything we currently can’t (or simply don’t) measure well ends up as a “gap.” By definition, we don’t yet have clear quantitative visibility into how well we’re doing on the gaps, or how quickly we’re moving across them: if we did, then they would be “benchmarks” (and hill-climbing targets) rather than gaps.
It’s tempting here to try to forecast progress on the “gaps” by using recent progress on the “benchmarks” as a reference class. But this yields a biased estimate; we should expect average progress on “gaps” to be much slower than average progress on “benchmarks.”
The difference comes from the two factors I mentioned at the start:
Hill-climbing on a benchmark tends to improve that benchmark more than other things (including other, non-hill-climbed measures of the same underlying trait)
Benchmarks are – by definition – the things that are easy to measure, and thus to hill-climb.
Progress on such things is currently very fast, and presumably some of that speed owes to the rapid, quantitative, and inter-comparable feedback that benchmarks provide.
It’s not clear how much this kind of methodology generalizes to things that are important but inherently harder to measure. (How do you improve something if you can’t tell how good it is in the first place?)
Presumably things that are inherently harder to measure will improve more slowly – it’s harder to go fast when you’re “stumbling around in the dark” – and it’s difficult to know how big this effect is in advance.
I don’t get a sense that AI labs are taking this kind of thing very seriously at the moment (at least in their public communications, anyway). The general vibe I get is like, “we love working on improvements to measurable things, and everything we can measure gets better with scale, so presumably all the things we can’t measure will get solved by scale too; in the meantime we’ll work on hill-climbing the hills that are on our radar.”
If the unmeasured stuff were simply a random sample from the same distribution as the measured stuff, this approach would make sense, but we have no reason to believe this is the case. Is all this scaling and benchmark-chasing really lifting all boats, simultaneously? I mean, how would we know, right? By definition, we can’t measure what we can’t measure.
Or, more accurately, we can’t measure it in quantitative and observer-independent fashion. That doesn’t mean we don’t know it exists.
Indeed, some of this “dark matter” may well be utterly obvious when one is using the models in practice. It’s there, and as humans we can see it perfectly well, even if we would find it difficult to think up a good benchmark for it.
As LLMs get smarter – and as the claimed distance between them and “human experts” diminishes – I find that these “obvious yet difficult-to-quantify gaps” increasingly dominate my experience of LLMs as a user.
Current frontier models are, in some sense, “much better at me than coding.” In a formal coding competition I would obviously lose to these things; I might well perform worse at more “real-world” stuff like SWE-Bench Verified, too.
Among humans with similar scores on coding and math benchmarks, many (if not all) of them would be better at my job than I am, and fully capable of replacing me as an employee. Yet the models are not capable of this.
Claude-3.7-Sonnet really does have remarkable programming skills (even by human standards), but it can’t adequately do my job – not even for a single day, or (I would expect) for a single hour. I can use it effectively to automate certain aspects of my work, but it needs constant handholding, and that’s when it’s on the fairly narrow rails of something like Cursor rather than in the messy, open-ended “agentic environment” that is the real workplace.
What is it missing? I don’t know, it’s hard to state precisely. (If it were easier to state precisely, it would be a “benchmark” rather than a “gap” and we’d be having a very different conversation right now.)
Something like, I dunno… “taste”? “Agency”?
“Being able to look at a messy real-world situation and determine what’s important and what’s not, rather than treating everything like some sort of school exam?”
“Talking through the problem like a coworker, rather than barreling forward with your best guess about what the nonexistent teacher will give you good marks for doing?”
“Acting like a curious experimenter, not a helpful-and-harmless pseudo-expert who already knows the right answer?”
“(Or, for that matter, acting like an RL ‘reasoning’ system awkwardly bolted on to an existing HHH chatbot, with a verbose CoT side-stream that endlessly speculates about ‘what the user might have really meant’ every time I say something unclear rather than just fucking asking me like any normal person would?)”
If you use LLMs to do serious work, these kinds of bottlenecks become apparent very fast.
Scaling up training on “difficult academic exam”-type tasks is not going to remove the things that prevent the LLM from doing my job. I don’t know what those things are, exactly, but I do know that the problem is not “insufficient skill at impressive-looking ‘expert’ benchmark tasks.” Why? Because the model is already way better than me at difficult academic tests, and yet – it still can’t autonomously do my job, or yours, or (to a first approximation) anyone else’s.
Or, consider the ascent of GPQA scores. As “Preparing for the Intelligence Explosion” puts it:
Well, that certainly sounds impressive. Certainly something happened here. But what, exactly?
If you showed this line to someone who knew nothing about the context, I imagine they would (A) vastly overestimate the usefulness of current models as academic research assistants, and (B) vastly underestimate the usefulness of GPT-4 in the same role.
GPT-4 already knew all kinds of science facts of the sort that GPQA tests, even if it didn’t know them quite as well, or wasn’t as readily able to integrate them in the exact way that GPQA expects (that’s hill-climbing for you).
What was lacking was not mainly the knowledge itself – GPT-4 was already incredibly good at obscure book-learning! – but all the… other stuff involved in competent research assistance. The dark matter, the soft skills, the unmesaurables, the gaps. The kind of thing I was talking about just a moment ago. “Taste,” or “agency,” or “acting like you have real-world experience rather than just being a child prodigy who’s really good at exams.”
And the newer models don’t have that stuff either. They can “do” more things if you give them constant handholding, but they still need that hand-holding; they still can’t apply common sense to reason their way through situations that don’t resemble a school exam or an interaction with a gormless ChatGPT user in search of a clean, decontextualized helpful-and-harmless “answer.” If they were people, I would not want to hire them, any more than I’d want to hire GPT-4.
If (as I claim) all this “dark matter” is not improving much, then we are not going to get a self-improvement loop unless
It turns out that models without these abilities can bootstrap their way into having them
Labs start taking the “dark matter” much more seriously than they have so far, rather than just hill-climbing easily measurable things and leaning on scaling and RSI for everything else
I doubt that (1) will hold: the qualities that are missing are closely related to things like “ability to act without supervision” and “research/design/engineering taste” that seem very important for self-improvement.
As for (2), well, my best guess is that we’ll have to wait until ~2027-2028, at which point it will become clear that the “just scale and hill-climb and increasingly defer to your HHH assistant” approach somehow didn’t work – and then, at last, we’ll start seeing serious attempts to succeed at the unmeasurable.
I think the labs might well be rational in focusing on this sort of “handheld automation”, just to enable their researchers to code experiments faster and in smaller teams.
My mental model of AI R&D is that it can be bottlenecked roughly by three things: compute, engineering time, and the “dark matter” of taste and feedback loops on messy research results. I can certainly imagine a model of lab productivity where the best way to accelerate is improving handheld automation for the entirety of 2025. Say, the core paradigm is fixed; but inside that paradigm, the research team has more promising ideas than they have time to implement and try out on smaller-scale experiments; and they really do not want to hire more people.
If you consider the AI lab as a fundamental unit that wants to increase its velocity, and works on things that make models faster, it’s plausible they can be aware how bad the model performance is on research taste, and still not be making a mistake by ignoring your “dark matter” right now. They will work on it when they are faster.
@elifland what do you think is the strongest argument for long(er) timelines? Do you think it’s essentially just “it takes a long time for researchers learn how to cross the gaps”?
Or do you think there’s an entirely different frame (something that’s in an ontology that just looks very different from the one presented in the “benchmarks + gaps argument”?)
A few possible categories of situations we might have long timelines, off the top of my head:
Benchmarks + gaps is still best: overall gap is somewhat larger + slowdown in compute doubling time after 2028, but trend extrapolations still tell us something about gap trends: This is how I would most naturally think about how timelines through maybe the 2030s are achieved, and potentially beyond if neither of the next hold.
Others are best (more than of one of these can be true):
The current benchmarks and evaluations are so far away from AGI that trends on them don’t tell us anything (including regarding how fast gaps might be crossed). In this case one might want to identify the 1-2 most important gaps and reason about when we will cross these based on gears-level reasoning or trend extrapolation/forecasting on “real-world” data (e.g. revenue?) rather than trend extrapolation on benchmarks. Example candidate “gaps” that I often hear for these sorts of cases are the lack of feedback loops and the “long-tail of tasks” / reliability.
A paradigm shift in AGI training is needed and benchmark trends don’t tell us much about when we will achieve this (this is basically Steven’s sibling comment): in this case the best analysis might involve looking at the base rate of paradigm shifts per research effort, and/or looking at specific possible shifts.
^ this taxonomy is not comprehensive, just things I came up with quickly. Might be missing something that would be good.
To cop out answer your question, I feel like if I were making a long-timelines argument I’d argue that all 3 of those would be ways of forecasting to give weight to, then aggregate. If I had to choose just one I’d probably still go with (1) though.
edit: oh there’s also the “defer to AI experts” argument. I mostly try not to think about deference-based arguments because thinking on the object-level is more productive, though I think if I were really trying to make an all-things-considered timelines distribution there’s some chance I would adjust to longer due to deference arguments (but also some chance I’d adjust toward shorter, given that lots of people who have thought deeply about AGI / are close to the action have short timelines).
There’s also “base rate of super crazy things happening is low” style arguments which I don’t give much weight to.
Thanks. I think this argument assumes that the main bottleneck to AI progress is something like research engineering speed, such that accelerating research engineering speed would drastically increase AI progress?
I think that that makes sense as long as we are talking about domains like games / math / programming where you can automatically verify the results, but that something like speed of real-world interaction becomes the bottleneck once shifting to more open domains.
Consider an AI being trained on a task such as “acting as the CEO for a startup”. There may not be a way to do this training other than to have it actually run a real startup, and then wait for several years to see how the results turn out. Even after several years, it will be hard to say exactly which parts of the decision process contributed, and how much of the startup’s success or failure was due to random factors. Furthermore, during this process the AI will need to be closely monitored in order to make sure that it does not do anything illegal or grossly immoral, slowing down its decision process and thus whole the training. And I haven’t even mentioned the expense of a training run where running just a single trial requires a startup-level investment (assuming that the startup won’t pay back its investment, of course).
Of course, humans do not learn to be CEOs by running a million companies and then getting a reward signal at the end. Human CEOs come in with a number of skills that they have already learned from somewhere else that they then apply to the context of running a company, shifting between their existing skills and applying them as needed. However, the question of what kind of approach and skill to apply in what situation, and how to prioritize between different approaches, is by itself a skillset that needs to be learned… quite possibly through a lot of real-world feedback.