Suppose we get an AI system which can (at least) automate away the vast majority of the job of a research engineer at an AI company (e.g. OpenAI). Let’s say this results in an increase in productivity among OpenAI capabilities researchers at least equivalent to the productivity you’d get as if the human employees operated 10x faster (but couldn’t use advanced AI in their work). This corresponds to “AIs that 10x AI R&D labor” as defined more precisely in this post. And, let’s say that this level of speed up is rolled out and exists (on average) in an AI company within 2 years (by Jan 2027). (I think this is about 20% likely, and would be about 25% likely if we allowed for some human adoption time.)
My current sense based on the post is that this wouldn’t substantially update you about the possibility of AGI (as you define it) by 2030. This sense is based on what you describe as the key indicators and your claim about a need for breakthroughs. Is this right?
I think the 10x AI R&D labor milestone is reasonably likely to be quickly reachable just by scaling up existing approaches. Full automation would probably require additional qualitatively different components, but this might be quite quickly reached if AI algorithmic progress is substantially accelerated and it isn’t clear this would look like much more of a breakthrough than “we can put LLMs inside an agent loop” is a breakthrough.
I’m very skeptical of AI being on the brink of dramatically accelerating AI R&D.
My current model is that ML experiments are bottlenecked not on software-engineer hours, but on compute. See Ilya Sutskever’s claim here:
95% of progress comes from the ability to run big experiments quickly. The utility of running many experiments is much less useful.
What actually matters for ML-style progress is picking the correct trick, and then applying it to a big-enough model. If you pick the trick wrong, you ruin the training run, which (a) potentially costs millions of dollars, (b) wastes the ocean of FLOP you could’ve used for something else.
And picking the correct trick is primarily a matter of research taste, because:
Tricks that work on smaller scales often don’t generalize to larger scales.
Tricks that work on larger scales often don’t work on smaller scales (due to bigger ML models having various novel emergent properties).
Simultaneously integrating several disjunctive incremental improvements into one SotA training run is likely nontrivial/impossible in the general case.[1]
So 10x’ing the number of small-scale experiments is unlikely to actually 10x ML research, along any promising research direction.
And, on top of that, I expect that AGI labs don’t actually have the spare compute to do that 10x’ing. I expect it’s all already occupied 24⁄7 running all manners of smaller-scale experiments, squeezing whatever value out of them that can be squeezed out. (See e. g. Superalignment team’s struggle to get access to compute: that suggests there isn’t an internal compute overhang.)
Indeed, an additional disadvantage of AI-based researchers/engineers is that their forward passes would cut into that limited compute budget. Offloading the computations associated with software engineering and experiment oversight onto the brains of mid-level human engineers is potentially more cost-efficient.
As a separate line of argumentation: Suppose that, as you describe it in another comment, we imagine that AI would soon be able to give senior researchers teams of 10x-speed 24/7-working junior devs, to whom they’d be able to delegate setting up and managing experiments. Is there a reason to think that any need for that couldn’t already be satisfied?
If it were an actual bottleneck, I would expect it to have already been solved: by the AGI labs just hiring tons of competent-ish software engineers. They have vast amounts of money now, and LLM-based coding tools seem competent enough to significantly speed up a human programmer’s work on formulaic tasks. So any sufficiently simple software-engineering task should already be done at lightning speeds within AGI labs.
In addition: the academic-research and open-source communities exist, and plausibly also fill the niche of “a vast body of competent-ish junior researchers trying out diverse experiments”. The task of keeping senior researchers up-to-date on openly published insights should likewise already be possible to dramatically speed up by tasking LLMs with summarizing them, or by hiring intermediary ML researchers to do that.
So I expect the market for mid-level software engineers/ML researchers to be saturated.
So, summing up:
10x’ing the ability to run small-scale experiments seems low-value, because:
The performance of a trick at a small scale says little (one way or another) about its performance on a bigger scale.
Integrating a scalable trick into the SotA-model tech stack is highly nontrivial.
Most of the value and insight comes from full-scale experiments, which are bottlenecked on compute and senior-researcher taste.
AI likely can’t even 10x small-scale experimentation, because that’s also already bottlenecked on compute, not on mid-level engineer-hours. There’s no “compute overhang”; all available compute is already in use 24⁄7.
If it weren’t the case, there’s nothing stopping AGI labs from hiring mid-level engineers until they are no longer bottlenecked on their time; or tapping academic research/open-source results.
AI-based engineers would plausibly be less efficient than human engineers, because their inference calls would cut into the compute that could instead be spent on experiments.
If so, then AI R&D is bottlenecked on research taste, system-design taste, and compute, and there’s relatively little non-AGI-level models can contribute to it. Maybe a 2x speed-up, at most, somehow; not a 10x’ing.
(@Nathan Helm-Burger, I recall you’re also bullish on AI speeding up AI R&D. Any counterarguments to the above?)
See the argument linked in the original post, that training SotA models is an incredibly difficult infrastructural problem that requires reasoning through the entire software-hardware stack. If you find a promising trick A that incrementally improves performance in some small setup, and you think it’d naively scale to a bigger setup, you also need to ensure it plays nice with tricks B, C, D.
For example, suppose that using A requires doing some operation on a hidden state that requires that state to be in a specific representation, but there’s a trick B which exploits a specific hardware property to dramatically speed up backprop by always keeping hidden states in a different representation. Then you need to either throw A or B out, or do something non-trivially clever to make them work together.
And then it’s a thousand little things like this; a vast Spaghetti Tower such that you can’t improve on a small-seeming part of it without throwing a dozen things in other places in disarray. (I’m reminded of the situation in the semiconductor industry here.)
In which case, finding a scalable insight isn’t enough: even integrating this insight requires full end-to-end knowledge of the tech stack and sophisticated research taste; something only senior researchers have.
I think you are somewhat overly fixated on my claim that “maybe the AIs will accelerate the labor input R&D by 10x via basically just being fast and cheap junior employees”. My original claim (in the subcomment) is “I think it could suffice to do a bunch of relatively more banal things extremely fast and cheap”. The “could” part is important. Correspondingly, I think this is only part of the possibilities, though I do think this is a pretty plausible route. Additionally, banal does not imply simple/easy and some level of labor quality will be needed.
(I did propose junior employees as an analogy which maybe implied simple/easy. I didn’t really intend this implication. I think the AIs have to be able to do at least somewhat hard tasks, but maybe don’t need to have a ton of context or have much taste if they can compensate with other advantages.)
I’ll argue against your comment, but first, I’d like to lay out a bunch of background to make sure we’re on the same page and to give a better understanding to people reading through.
Frontier LLM progress has historically been driven by 3 factors:
Increased spending on training runs ($)
Hardware progress (compute / $)
Algorithmic progress (intelligence / compute)
(The split seems to be very roughly 2⁄5, 1⁄5, 2⁄5 respectively.)
If we zoom into algorithmic progress, there are two relevant inputs to the production function:
Compute (for experiments)
Labor (from human researchers and engineers)
A reasonably common view is that compute is a very key bottleneck such that even if you greatly improved labor, algorithmic progress wouldn’t go much faster. This seems plausible to me (though somewhat unlikely), but this isn’t what I was arguing about. I was trying to argue (among other things) that scaling up basically current methods could result in an increase in productivity among OpenAI capabilities researchers at least equivalent to the productivity you’d get as if the human employees operated 10x faster. (In other words, 10x’ing this labor input.)
Now, I’ll try to respond to your claims.
My current model is that ML experiments are bottlenecked not on software-engineer hours, but on compute.
Maybe, but that isn’t exactly a crux in this discussion as noted above. The relevant question is whether the important labor going into ML experiments is more “insights” or “engineering” (not whether both of these are bottlenecked on compute).
What actually matters for ML-style progress is picking the correct trick, and then applying it to a big-enough model.
My sense is that engineering is most of the labor, and most people I talk to with relevant experience have a view like: “taste is somewhat important, but lots of people have that and fast execution is roughly as important or more important”. Notably, AI companies really want to hire fast and good engineers and seem to care comparably about this as about more traditional research scientist jobs.
One relevant response would be “sure, AI companies want to hire good engineers, but weren’t we talking about the AIs being bad engineers who run fast?”
I think the AI engineers probably have to be quite good at moderate horizon software engineering, but also that scaling up current approaches can pretty likely achieve this. Possibly my “junior hire” analogy was problematic as “junior hire” can mean not as good at programming in addition to “not as much context at this company, but good at the general skills”.
So 10x’ing the number of small-scale experiments is unlikely to actually 10x ML research, along any promising research direction.
I wasn’t saying that these AIs would mostly be 10x’ing the number of small-scale experiments, though I do think that increasing the number and serial speed of experiments is an important part of the picture.
There are lots of other things that engineers do (e.g., increase the efficiency of experiments so they use less compute, make it much easier to run experiments, etc.).
Indeed, an additional disadvantage of AI-based researchers/engineers is that their forward passes would cut into that limited compute budget. Offloading the computations associated with software engineering and experiment oversight onto the brains of mid-level human engineers is potentially more cost-efficient.
Sure, but we have to be quantitative here. As a rough (and somewhat conservative) estimate, if I were to manage 50 copies of 3.5 Sonnet who are running 1⁄4 of the time (due to waiting for experiments, etc), that would cost roughly 50 copies * 70 tok / s * 1 / 4 uptime * 60 * 60 * 24 * 365 sec / year * (15 / 1,000,000) $ / tok = $400,000. This cost is comparable to salaries at current compute prices and probably much less than how much AI companies would be willing to pay for top employees. (And note this is after API markups etc. I’m not including input prices for simplicity, but input is much cheaper than output and it’s just a messy BOTEC anyway.)
Yes, this compute comes directly at the cost of experiments, but so do employee salaries at current margins. (Maybe this will be less true in the future.)
At the point when AIs are first capable of doing the relevant tasks, it seems likely it is pretty expensive, but I expect costs to drop pretty quickly. And, AI companies will have far more compute in the future as this increases at a rapid rate, making the plausible number of instances substantially higher.
Is there a reason to think that any need for that couldn’t already be satisfied? If it were an actual bottleneck, I would expect it to have already been solved: by the AGI labs just hiring tons of competent-ish software engineers.
I think AI companies would be very happy to hire lots of software engineers who work for nearly free, run 10x faster, work 24⁄7, and are pretty good research engineers. This seems especially true if you add other structural advantages of AI into the mix (train once and use many times, fewer personnel issues, easy to scale up and down, etc). The serial speed is very important.
(The bar of “competent-ish” seems too low. Again, I think “junior” might have been leading you astray here, sorry about that. Imagine more like median AI company engineering hire or a bit better than this. My original comment said “automating research engineering”.)
LLM-based coding tools seem competent enough to significantly speed up a human programmer’s work on formulaic tasks. So any sufficiently simple software-engineering task should already be done at lightning speeds within AGI labs.
I’m not sure I buy this claim about current tools. Also, I wasn’t making a claim about AIs just doing simple tasks (banal does not mean simple) as discussed earlier.
Stepping back from engineering vs insights, my sense is that it isn’t clear that the AIs will be terrible at insights or broader context. So, I think it will probably be more like they are very fast engineers and ok at experimental direction. Being ok helps a bunch by avoiding the need for human intervention at many points.
Maybe a relevant crux is: “Could scaling up current methods yield AIs that can mostly autonomously automate software engineering tasks that are currently being done by engineers at AI companies?” (More precisely, succeed at these tasks very reliably with only a small amount of human advice/help amortized over all tasks. Probably this would partially work by having humans or AIs decompose into relatively smaller subtasks that require a bit less context, though this isn’t notably different from how humans do things themselves.)
But, I think you maybe also have a further crux like: “Does making software engineering at AI companies cheap and extremely fast greatly accelerate the labor input to AI R&D?”
I was trying to argue (among other things) that scaling up basically current methods could result in an increase in productivity among OpenAI capabilities researchers at least equivalent to the productivity you’d get as if the human employees operated 10x faster
You’re right, that’s a meaningfully different claim and I should’ve noticed the difference.
I think I would disagree with it as well. Suppose we break up this labor into, say,
“Banal” software engineering.
Medium-difficult systems design and algorithmic improvements (finding optimizations, etc.).
Coming up with new ideas regarding how AI capabilities can be progressed.
High-level decisions regarding architectures, research avenues and strategies, etc. (Not just inventing transformers/the scaling hypothesis/the idea of RL-on-CoT, but picking those approaches out of a sea of ideas, and making the correct decision to commit hard to them.)
In turn, the factors relevant to (4) are:
(a) The serial thinking of the senior researchers and the communication/exchange of ideas between them.
(Where “the senior researchers” are defined as “the people with the power to make strategic research decisions at a given company”.)
(b) The outputs of significant experiments decided on by the senior researchers.
(c) The pool of untested-at-large-scale ideas presented to the senior researchers.
Importantly, in this model, speeding up (1), (2), (3) can only speed up (4) by increasing the turnover speed of (b) and the quality of (c). And I expect that non-AGI-complete AI cannot improve the quality of ideas (3) and cannot directly speed up/replace (a)[1], meaning any acceleration from it can only come from accelerating the engineering and the optimization of significant experiments.
Which, I expect, are in fact mostly bottlenecked by compute, and 10x’ing the human-labor productivity there doesn’t 10x the overall productivity of the human-labor input; it remains stubbornly held up by (a). (I do buy that it can significantly speed it up, say 2x it. But not 10x it.)
Separately, I’m also skeptical that near-term AI can speed up the nontrivial engineering involved in medium-difficult systems design and the management of significant experiments:
Stepping back from engineering vs insights, my sense is that it isn’t clear that the AIs will be terrible at insights or broader context. So, I think it will probably be more like they are very fast engineers and ok at experimental direction. Being ok helps a bunch by avoiding the need for human intervention at many points.
It seems to me that AIs have remained stubbornly terrible at this from GPT-3 to GPT-4 to Sonnet 3.5.1 to o1[2]; that the improvement on this hard-to-specify quality has been ~0. I guess we’ll see if o3 (or an o-series model based on the next-generation base model) change that. AI does feel right on the cusp of getting good at this...
… just as it felt at the time of GPT-3.5, and GPT-4, and Sonnet 3.5.1, and o1. That just the slightest improvement along this axis would allow us to plug the outputs of AI cognition into its inputs and get a competent, autonomous AI agent.
And yet here we are, still.
It’s puzzling to me and I don’t quite understand why it wouldn’t work, but based on the previous track record, I do in fact expect it not to work.
In other words: If an AI is able to improve the quality of ideas and/or reliably pluck out the best ideas from a sea of them, I expect that’s AGI and we can throw out all human cognitive labor entirely.
It seems to me that AIs have remained stubbornly terrible at this from GPT-3 to GPT-4 to Sonnet 3.5.1 to o1[2]; that the improvement on this hard-to-specify quality has been ~0.
Huh, I disagree reasonably strongly with this. Possible that something along these lines is an empirically testable crux.
FWIW my vibe is closer to Thane’s. Yesterday I commented that this discussion has been raising some topics that seem worthy of a systematic writeup as fodder for further discussion. I think here we’ve hit on another such topic: enumerating important dimensions of AI capability – such as generation of deep insights, or taking broader context into account – and then kicking off a discussion of the past trajectory / expected future progress on each dimension.
Some benchmarks got saturated across this range, so we can imagine “anti-saturated” benchmarks that didn’t yet noticeably move from zero, operationalizing intuitions of lack of progress. Performance on such benchmarks still has room to change significantly even with pretraining scaling in the near future, from 1e26 FLOPs of currently deployed models to 5e28 FLOPs by 2028, 500x more.
Sure, but we have to be quantitative here. As a rough (and somewhat conservative) estimate, if I were to manage 50 copies of 3.5 Sonnet who are running 1⁄4 of the time (due to waiting for experiments, etc), that would cost roughly 50 copies * 70 tok / s * 1 / 4 uptime * 60 * 60 * 24 * 365 sec / year * (15 / 1,000,000) $ / tok = $400,000. This cost is comparable to salaries at current compute prices and probably much less than how much AI companies would be willing to pay for top employees. (And note this is after API markups etc. I’m not including input prices for simplicity, but input is much cheaper than output and it’s just a messy BOTEC anyway.)
If you were to spend equal amounts of money on LLM inference and GPUs, that would mean that you’re spending $400,000 / year on GPUs. Divide that 50 ways and each Sonnet instance gets an $8,000 / year compute budget. Over the 18 hours per day that Sonnet is waiting for experiments, that is an average of $1.22 / hour, which is almost exactly the hourly cost of renting a single H100 on Vast.
So I guess the crux is “would a swarm of unreliable researchers with one good GPU apiece be more effective at AI research than a few top researchers who can monopolize X0,000 GPUs for months, per unit of GPU time spent”.
(and yes, at some point it the question switches to “would an AI researcher that is better at AI research than the best humans make better use of GPUs than the best humans” but a that point it’s a matter of quality, not quantity)
Sure, but I think that at the relevant point, you’ll probably be spending at least 5x more on experiments than on inference and potentially a much larger larger ratio if heavy test time compute usage isn’t important. I was just trying to argue that the naive inference cost isn’t that crazy.
Notably, if you give each researcher 2k gpu hours, that would be $2 / gpu hour * 2k * 24 * 365 = $35,040,000 per year which is much higher than the inference cost of the models!
Tricks that work on smaller scales often don’t generalize to larger scales.
Tricks that work on larger scales often don’t work on smaller scales (due to bigger ML models having various novel emergent properties).
My understanding is that these two claims are mostly false in practice. In particular, there have been a few studies (like e.g. this) which try to run yesterday’s algorithms with today’s scale, and today’s algorithms with yesterday’s scale, in order to attribute progress to scale vs algorithmic improvements. I haven’t gone through those studies in very careful detail, but my understanding is that they pretty consistently find today’s algorithms outperform yesterday’s algorithms even when scaled down, and yesterday’s algorithms underperform today’s even when scaled up. So unless I’ve badly misunderstood those studies, the mental model in which different tricks work best on different scales is basically just false, at least at the range of different scales the field has gone through in the past ~decade.
That said, there are cases where I could imagine Ilya’s claim making sense, e.g. if the “experiments” he’s talking about are experiments in using the net rather than training the net. Certainly one can do qualitatively different things with GPT4 than GPT2, so if one is testing e.g. a scaffolding setup or a net’s ability to play a particular game, then one needs to use the larger net. Perhaps that’s what Ilya had in mind?
I could imagine Ilya’s claim making sense, e.g. if the “experiments” he’s talking about are experiments in using the net rather than training the net
What I had in mind is something along these lines. More capable models[1] have various emergent properties. Specific tricks can rely on those properties being present, and work better or worse depending on that.
For example, the o-series training loop probably can’t actually “get off the ground” if the base model is only as smart as GPT-2: the model would ~never find its way to correct answers, so it’d never get reinforcement signals. You can still force it to work by sampling a billion guesses or by starting it with very easy problems (e. g., basic arithmetics?), but it’d probably deliver much less impressive results than applying it to GPT-4.
Scaling further down: I don’t recall if GPT-2 can make productive use of CoTs, but presumably e. g. GPT-1 can’t. At that point, the whole “do RL on CoTs” completely ceases to be a meaningful thing to try.
Generalizing: At a lower level of capabilities, there’s presumably a ton of various tricks that deliver a small bump to performance. Some of those tricks would have an effect size comparable to RL-on-CoTs-if-applied-at-this-scale. But out of a sea of those tricks, only a few of them would be such that their effectiveness rises dramatically with scale.
So, a more refined way to make my points would be:
If a trick shows promise at a small capability level, e. g. improving performance 10%, it doesn’t mean it’d show a similar 10%-improvement if applied at a higher capability level.
(Say, because it addresses a deficiency that a big-enough model just doesn’t have/which a big-enough pretraining run solves by default.)
If a trick shows marginal/no improvement at a small capability level, that doesn’t mean it won’t show a dramatic improvement at a higher capability level.
a few studies (like e.g. this) which try to run yesterday’s algorithms with today’s scale, and today’s algorithms with yesterday’s scale
My guess, based on the above, would be that even if today’s algorithms perform better than yesterday’s algorithms at smaller scales, the difference between their small-scale capabilities is less than the difference between yesterday’s algorithms and today’s algorithms at bigger scales. I. e.: some algorithms make nonlinearly better use of compute, such that figuring out which tricks are the best is easier at larger scales. (Telling apart a 5% capability improvement from a 80% one.)
Whether they’re more capable by dint of being bigger (GPT-4), or being trained on better data (Sonnet 3.5.1), or having a better training loop + architecture (DeepSeek V3), etc.
Thanks for the mention Thane. I think you make excellent points, and agree with all of them, to some degree. Yet, I’m expecting huge progress in AI algorithms to be unlocked by AI reseachers.
How closely are they adhering to the “main path” of scaling existing techniques with minor tweaks? If you want to know how a minor tweak affects your current large model at scale, that is a very compute-heavy researcher-time-light type of experiment. On the other hand, if you want to test a lot of novel new paths at much smaller scales, then you are in a relatively compute-light but researcher-time-heavy regime.
What fraction of the available compute resources is the company assigning to each of training/inference/experiments? My guess it that the current split is somewhere around 63/33/4. If this was true, and the company decided to pivot away from training to focus on experiments (0/33/67), this would be something like a 16x increase in compute for experiments. So maybe that changes the bottleneck?
I think that Ilya and the AGI labs are part of a school of thought that is very focused on tweaking the existing architecture slightly. This then is a researcher-time-light and compute-heavy paradigm.
I think the big advancements require going further afield, outside the current search-space of the major players.
Which is not to say that I think LLMs have to be thrown out as useless. I expect some kind of combo system to work. The question is, combined with what?
Well, my prejudice as someone from a neuroscience background is that I think there are untapped insights from studying the brain.
Look at the limitations of current AI that François Chollet discusses in his various interviews and lectures. I think he’s pointing at real flaws. Look how many data points it takes a typical ML model to learn a new task! How limited in-context learning is!
Brains are doing something different clearly. I think our current models are much more powerful than a mouse brain, and yet there are some things that mice learn better.
So, if you stopped spending your compute on big expensive experiments, and instead spent it on combing through the neuroscience literature looking for clues… Would the AI reseachers make a breakthrough? My guess is yes.
I also suspect that there are ideas in computer science, paths not yet explored with modern compute, that are hiding revolutionary insights. But to find them you’d need to go way outside the current paradigm. Set deep learning entirely aside and look at fundamental ideas. I doubt that this describes even 1% of the current time being spent by researchers currently at the big companies. Their path seems to be working, why should they look elsewhere? The cost to them personally of reorienting to entirely different fields of research would be huge. Not so for AI reseachers. They can search everything, and quickly.
I think the big advancements require going further afield, outside the current search-space of the major players.
Oh, I very much agree. But any associated software engineering and experiments would then be nontrivial, ones involving setting up a new architecture, correctly interpreting when it’s not working due to a bug vs. because it’s fundamentally flawed, figuring out which tweaks are okay to make and which tweaks would defeat the point of the experiment, et cetera. Something requiring sophisticated research taste; not something you can trivially delegate-and-forget to a junior researcher (as per @ryan_greenblatt’s vision). (And importantly, if this can be delegated to (AI models isomorphic to) juniors, this is something AGI labs can already do just by hiring juniors.)
Same regarding looking for clues in neuroscience/computer-science literature. In order to pick out good ideas, you need great research taste and plausibly a bird’s eye view on the entire hardware-software research stack. I wouldn’t trust a median ML researcher/engineer’s summary; I would expect them to miss great ideas while bringing slop to my attention, such that it’d be more time-efficient to skim over the literature myself.
In addition, this is likely also a part is where “95% of progress comes from the ability to run big experiments” comes into play. Tons of novel tricks/architectures would perform well at a small scale and flounder at a big scale, or vice versa. You need to pick a new approach and go hard on trying to make it work, not just lazily throw an experiment at it. Which is something that’s bottlenecked on the attention of a senior researcher, not a junior worker.
Overall, it sounds as if… you expect dramatically faster capabilities progress from the AGI labs pivoting towards exploring a breadth of new research directions, with the whole “AI researchers” thing being an unrelated feature? (They can do this pivot with or without them. And as per the compute-constraints arguments, borderline-competent AI researchers aren’t going to nontrivially improve on the companies’ ability to execute this pivot.)
So, I’ve been focusing on giving more of a generic view in my comments. Something that I think someone with similar background in neuroscience, and similar background in ML would endorse as roughly plausible.
I also have an inside view which says more specific things. Like, I don’t just vaguely think that there are probably some fruitful directions in neglected parts of computer science history and in recent neuroscience. What I actually have are specific hypotheses that I’ve been working hard on trying to code up experiments for.
If someone gave me engineering support and compute sufficient to actually get my currently planned experiments run, and the results looked like dead-ends, I think my timelines would go from 2-3 years out to 5-10 years. I’d also be much less confident that we’d see rapid efficiency and capability gains from algorithmic research post-AGI, because I’d be more in mindset of minor tweaks to existing paradigms and further expensive scaling.
This is why I’m basically thinking that I mostly agree with you, Thane, except for this inside view I have about specific approaches I think are currently neglected but unlikely to stay neglected.
Yeah, pretty much. Although I don’t expect this with super high confidence. Maybe 75%?
This is part of why I think a “pause” focused on large models / large training runs would actually dangerously accelerate progress towards AGI. I think a lot of well-resourced high-skill researchers would suddenly shift their focus onto breadth of exploration.
Another point:
I don’t think we’ll ever see AI agents that are exactly isomorphic to junior researchers. Why? Because of the weird spikiness of skills we see. In some ways the LLMs we have are much more skillful than junior researchers, in other ways they are pathetically bad. If you held their competencies constant except for improving the places where they are really bad, you’d suddenly have assistants much better than the median junior!
So when considering the details of how to apply the AI assistants we’re likely to get (based on extrapolating current spiky skill patterns), the set of affordances this offers to the top researchers is quite different from what having a bunch of juniors would be. I think this means we should expect things to weirder and less smooth than Ryan’s straightforward speed-up prediction.
If you look at the recent AI scientist work that’s been done you find this weird spiky portfolio. Having LLMs look through a bunch of papers and try to come up with new research directions? Mostly, but not entirely crap… But then since it’s relatively cheap to do, and quick to do, and not too costly to filter, the trade-off ends up seeming worthwhile?
As for new experiments in totally new regimes, yeah. That’s harder for current LLMs to help with than the well-trodden zones. But I think the specific skills currently beginning to be unlocked by the o1/o3 direction may be enough to make coding agents reliable enough to do a much larger share of this novel experiment setup.
So… It’s complicated. Can’t be sure of success. Can’t be sure of a wall.
I see a bunch of good questions explicitly or implicitly posed here. I’ll touch on each one.
1. What level of capabilities would be needed to achieve “AIs that 10x AI R&D labor”? My guess is, pretty high. Obviously you’d need to be able to automate at least 90% of what capabilities researchers do today. But 90% is a lot, you’ll be pushing out into the long tail of tasks that require taste, subtle tacit knowledge, etc. I am handicapped here by having absolutely no experience with / exposure to what goes on inside an AI research lab. I have 35 years of experience as a software engineer but precisely zero experience working on AI. So on this question I somewhat defer to folks like you. But I do suspect there is a tendency to underestimate how difficult / annoying these tail effects will be, this is the same fundamental principle as Hofstadter’s Law, the Programmer’s Credo, etc.
I have a personal suspicion that a surprisingly large fraction of work (possibly but not necessarily limited to “knowledge work”) will turn out to be “AGI complete”, meaning that it will require something approaching full AGI to undertake it at human level. But I haven’t really developed this idea beyond an intuition. It’s a crux and I would like to find a way to develop / articulate it further.
2. What does it even mean to accelerate someone’s work by 10x? It may be that if your experts are no longer doing any grunt work, they are no longer getting the input they need to do the parts of their job that are hardest to automate and/or where they’re really adding magic-sauce value. Or there may be other sources of friction / loss. In some cases it may over time be possible to find adaptations, in other cases it may be a more fundamental issue. (A possible counterbalance: if AIs can become highly superhuman at some aspects of the job, not just in speed/cost but in quality of output, that could compensate for delivering a less-than-10x time speedup on the overall workflow.)
3. If AIs that 10x AI R&D labor are 20% likely to arrive and be adopted by Jan 2027, would that update my view on the possibility of AGI-as-I-defined-it by 2030? It would, because (per the above) I think that delivering that 10x productivity boost would require something pretty close to AGI. In other words, conditional on AI R&D labor being accelerated 10x by Jan 2027, I would expect that we have something close to AGI by Jan 2027, which also implies that we were able to make huge advances in capabilities in 24 months. Whereas I think your model is that we could get that level of productivity boost from something well short of AGI.
If it turns out that we get 10x AI R&D labor by Jan 2027 but the AIs that enabled this are pretty far from AGI… then my world model is very confused and I can’t predict how I would update, I’d need to know more about how that worked out. I suppose it would probably push me toward shorter timelines, because it would suggest that “almost all work is easy” and RSI starts to really kick in earlier than my expectation.
4. Is this 10x milestone achievable just by scaling up existing approaches? My intuition is no. I think that milestone requires very capable AI (items 1+2 in this list). And I don’t see current approaches delivering much progress on things I think will be needed for such capability, such as long-term memory, continuous learning, ability to “break out of the chatbox” and deal with open-ended information sources and extraneous information, or other factors that I mentioned in the original post.
I am very interested in discussing any or all of these questions further.
Obviously you’d need to be able to automate at least 90% of what capabilities researchers do today.
Actually, I don’t think so. AIs don’t just substitute for human researchers, they can specialize differently. Suppose (for simplicity) there are 2 roughly equally good lines of research that can substitute (e.g. they create some fungible algorithmic progress) and capability researchers currently do 50% of each. Further, suppose that AIs can 30x accelerate the first line of research, but are worthless for the second. This could yield >10x acceleration via researchers just focusing on the first line of research (depending on how diminishing returns go).
This doesn’t make a huge difference to my bottom line view, but it seems likely that this sort of change in specialization makes a 2x difference.
But 90% is a lot, you’ll be pushing out into the long tail of tasks that require taste, subtle tacit knowledge, etc.
I think it could suffice to do a bunch of relatively more banal things extremely fast and cheap. In particular, it could suffice to do: software engineering, experiment babysitting, experiment debugging, optimization, improved experiment interpretation (e.g., trying to identify the important plots and considerations and presenting as concisely and effectively as possible), and generally checking experiment prior to launching them.
As an intution pump, imagine you had nearly free junior hires who run 10x faster and also work all hours. Because they are free, you can run tons of copies. I think this could pretty plausibly speed things up by 10x.
I have a personal suspicion that a surprisingly large fraction of work (possibly but not necessarily limited to “knowledge work”) will turn out to be “AGI complete”, meaning that it will require something approaching full AGI to undertake it at human level.
I’m not sure if I exactly disagree, but I do think there is a ton of variation in the human range such that I dispute the way you seem to use “AGI complete”. I do think that the systems doing this acceleration will be quite general and capable and will be in some sense close to AGI. (Though less so if this occurs earlier like in my 20th percentile world.)
And I don’t see current approaches delivering much progress on things I think will be needed for such capability, such as long-term memory, continuous learning, ability to “break out of the chatbox” and deal with open-ended information sources and extraneous information, or other factors that I mentioned in the original post.
Suppose a company specifically trained an AI system to be very familiar with its code base and infrastructure and relatively good at doing experiments for it. Then, it seems plausible that (with some misc schlep) the only needed context would be project specific context. It seems pretty plausible you can fit the context for tasks humans would do in a week into a 1 million token context window especially with some tweaks and some forking/sub-agents. And automating 1 week seems like it could suffice for big acceleration depending on various details. (Concretely, code is roughly 10 tokens per line, we might expect the AI to write <20k lines including revision, commands etc and to receive not much more than this amount of input. Books are maybe 150k tokens for reference, so the question is whether the AI needs over 6 books of context for 1 week for work. Currently, when AIs automate longer tasks they often do so via fewer steps than humans, spitting out the relevant outputs more directly, so I expect that the context needed for the AI is somewhat less.) Of course, it isn’t clear that models will be able to use their context window as well as humans use longer term memory.
As far as continuous learning, what if the AI company does online training of their AI systems based on all internal usage[1]? (Online training = just RL train on all internal usage based on human ratings or other sources of feedback.) Is the concern that this will be too sample inefficent (even with proliferation or other hacks)? (I don’t think it is obvious this goes either way but a binary “no continuous learning method is known” doesn’t seem right to me.)
AIs don’t just substitute for human researchers, they can specialize differently. Suppose (for simplicity) there are 2 roughly equally good lines of research that can substitute (e.g. they create some fungible algorithmic progress) and capability researchers currently do 50% of each. Further, suppose that AIs can 30x accelerate the first line of research, but are worthless for the second. This could yield >10x acceleration via researchers just focusing on the first line of research (depending on how diminishing returns go).
Good point, this would have some impact.
As an intution pump, imagine you had nearly free junior hires who run 10x faster, but also work all hours. Because they are free, you can run tons of copies. I think this could pretty plausibly speed things up by 10x.
Wouldn’t you drown in the overhead of generating tasks, evaluating the results, etc.? As a senior dev, I’ve had plenty of situations where junior devs were very helpful, but I’ve also had plenty of situations where it was more work for me to manage them than it would have been to do the job myself. These weren’t incompetent people, they just didn’t understand the situation well enough to make good choices and it wasn’t easy to impart that understanding. And I don’t think I’ve ever been sole tech lead for a team that was overall more than, say, 5x more productive than I am on my own – even when many of the people on the team were quite senior themselves. I can’t imagine trying to farm out enough work to achieve 10x of my personal productivity. There’s only so much you can delegate unless the system you’re delegating to has the sort of taste, judgement, and contextual awareness that a junior hire more or less by definition does not. Also you might run into the issue I mentioned where the senior person in the center of all this is no longer getting their hands dirty enough to collect the input needed to drive their high-level intuition and do their high-value senior things.
Hmm, I suppose it’s possible that AI R&D has a different flavor than what I’m used to. The software projects I’ve spent my career on are usually not very experimental in nature; the goal is generally not to learn whether an idea shows promise, it’s to design and implement code to implement a feature spec, for integration into the production system. If a junior dev does a so-so job, I have to work with them to bring it up to a higher standard, because we don’t want to incur the tech debt of integrating so-so code, we’d be paying for it for years. Maybe that plays out differently in AI R&D?
Incidentally, in this scenario, do you actually get to 10x the productivity of all your staff? Or do you just get to fire your junior staff? Seems like that depends on the distribution of staff levels today and on whether, in this world, junior staff can step up and productively manage AIs themselves.
Suppose a company specifically trained an AI system to be very familiar with its code base and infrastructure and relatively good at doing experiments for it. Then, it seems plausible that (with some misc schlep) the only needed context would be project specific context. …
These are fascinating questions but beyond what I think I can usefully contribute to in the format of a discussion thread. I might reach out at some point to see whether you’re open to discussing further. Ultimately I’m interested in developing a somewhat detailed model, with well-identified variables / assumptions that can be tested against reality.
Wouldn’t you drown in the overhead of generating tasks, evaluating the results, etc.? As a senior dev, I’ve had plenty of situations where junior devs were very helpful, but I’ve also had plenty of situations where it was more work for me to manage them than it would have been to do the job myself. These weren’t incompetent people, they just didn’t understand the situation well enough to make good choices and it wasn’t easy to impart that understanding. And I don’t think I’ve ever been sole tech lead for a team that was overall more than, say, 5x more productive than I am on my own – even when many of the people on the team were quite senior themselves. I can’t imagine trying to farm out enough work to achieve 10x of my personal productivity. There’s only so much you can delegate unless the system you’re delegating to has the sort of taste, judgement, and contextual awareness that a junior hire more or less by definition does not. Also you might run into the issue I mentioned where the senior person in the center of all this is no longer getting their hands dirty enough to collect the input needed to drive their high-level intuition and do their high-value senior things.
I’ve had a pretty similar experience personally but:
I think serial speed matters a lot and you’d be willing to go through a bunch more hassle if the junior devs worked 24⁄7 and at 10x speed.
Quantity can be a quality of its own—if you have truely vast (parallel) quantities of labor, you can be much more demanding and picky. (And make junior devs do much more work to understand what is going on.)
I do think the experimentation thing is probably somewhat big, but I’m uncertain.
(This one is breaking with the junior dev analogy, but whatever.) In the AI case, you can train/instruct once and then fork many times. In the analogy, this would be like you spending 1 month training the junior dev (who still works 24⁄7 and at 10x speed, so 10 months for them) and then forking them into many instances. Of course, perhaps AI sample efficiency is lower. However, my personal guess is that lots of compute spent on learning and aggressive schlep (e.g. proliferation, lots of self-supervised learning, etc) can plausibly substantially reduce or possibly eliminate the gap (at least once AIs are more capable) similar to how it works for EfficientZero.
Suppose we get an AI system which can (at least) automate away the vast majority of the job of a research engineer at an AI company (e.g. OpenAI). Let’s say this results in an increase in productivity among OpenAI capabilities researchers at least equivalent to the productivity you’d get as if the human employees operated 10x faster (but couldn’t use advanced AI in their work). This corresponds to “AIs that 10x AI R&D labor” as defined more precisely in this post. And, let’s say that this level of speed up is rolled out and exists (on average) in an AI company within 2 years (by Jan 2027). (I think this is about 20% likely, and would be about 25% likely if we allowed for some human adoption time.)
My current sense based on the post is that this wouldn’t substantially update you about the possibility of AGI (as you define it) by 2030. This sense is based on what you describe as the key indicators and your claim about a need for breakthroughs. Is this right?
I think the 10x AI R&D labor milestone is reasonably likely to be quickly reachable just by scaling up existing approaches. Full automation would probably require additional qualitatively different components, but this might be quite quickly reached if AI algorithmic progress is substantially accelerated and it isn’t clear this would look like much more of a breakthrough than “we can put LLMs inside an agent loop” is a breakthrough.
I’m very skeptical of AI being on the brink of dramatically accelerating AI R&D.
My current model is that ML experiments are bottlenecked not on software-engineer hours, but on compute. See Ilya Sutskever’s claim here:
What actually matters for ML-style progress is picking the correct trick, and then applying it to a big-enough model. If you pick the trick wrong, you ruin the training run, which (a) potentially costs millions of dollars, (b) wastes the ocean of FLOP you could’ve used for something else.
And picking the correct trick is primarily a matter of research taste, because:
Tricks that work on smaller scales often don’t generalize to larger scales.
Tricks that work on larger scales often don’t work on smaller scales (due to bigger ML models having various novel emergent properties).
Simultaneously integrating several disjunctive incremental improvements into one SotA training run is likely nontrivial/impossible in the general case.[1]
So 10x’ing the number of small-scale experiments is unlikely to actually 10x ML research, along any promising research direction.
And, on top of that, I expect that AGI labs don’t actually have the spare compute to do that 10x’ing. I expect it’s all already occupied 24⁄7 running all manners of smaller-scale experiments, squeezing whatever value out of them that can be squeezed out. (See e. g. Superalignment team’s struggle to get access to compute: that suggests there isn’t an internal compute overhang.)
Indeed, an additional disadvantage of AI-based researchers/engineers is that their forward passes would cut into that limited compute budget. Offloading the computations associated with software engineering and experiment oversight onto the brains of mid-level human engineers is potentially more cost-efficient.
As a separate line of argumentation: Suppose that, as you describe it in another comment, we imagine that AI would soon be able to give senior researchers teams of 10x-speed 24/7-working junior devs, to whom they’d be able to delegate setting up and managing experiments. Is there a reason to think that any need for that couldn’t already be satisfied?
If it were an actual bottleneck, I would expect it to have already been solved: by the AGI labs just hiring tons of competent-ish software engineers. They have vast amounts of money now, and LLM-based coding tools seem competent enough to significantly speed up a human programmer’s work on formulaic tasks. So any sufficiently simple software-engineering task should already be done at lightning speeds within AGI labs.
In addition: the academic-research and open-source communities exist, and plausibly also fill the niche of “a vast body of competent-ish junior researchers trying out diverse experiments”. The task of keeping senior researchers up-to-date on openly published insights should likewise already be possible to dramatically speed up by tasking LLMs with summarizing them, or by hiring intermediary ML researchers to do that.
So I expect the market for mid-level software engineers/ML researchers to be saturated.
So, summing up:
10x’ing the ability to run small-scale experiments seems low-value, because:
The performance of a trick at a small scale says little (one way or another) about its performance on a bigger scale.
Integrating a scalable trick into the SotA-model tech stack is highly nontrivial.
Most of the value and insight comes from full-scale experiments, which are bottlenecked on compute and senior-researcher taste.
AI likely can’t even 10x small-scale experimentation, because that’s also already bottlenecked on compute, not on mid-level engineer-hours. There’s no “compute overhang”; all available compute is already in use 24⁄7.
If it weren’t the case, there’s nothing stopping AGI labs from hiring mid-level engineers until they are no longer bottlenecked on their time; or tapping academic research/open-source results.
AI-based engineers would plausibly be less efficient than human engineers, because their inference calls would cut into the compute that could instead be spent on experiments.
If so, then AI R&D is bottlenecked on research taste, system-design taste, and compute, and there’s relatively little non-AGI-level models can contribute to it. Maybe a 2x speed-up, at most, somehow; not a 10x’ing.
(@Nathan Helm-Burger, I recall you’re also bullish on AI speeding up AI R&D. Any counterarguments to the above?)
See the argument linked in the original post, that training SotA models is an incredibly difficult infrastructural problem that requires reasoning through the entire software-hardware stack. If you find a promising trick A that incrementally improves performance in some small setup, and you think it’d naively scale to a bigger setup, you also need to ensure it plays nice with tricks B, C, D.
For example, suppose that using A requires doing some operation on a hidden state that requires that state to be in a specific representation, but there’s a trick B which exploits a specific hardware property to dramatically speed up backprop by always keeping hidden states in a different representation. Then you need to either throw A or B out, or do something non-trivially clever to make them work together.
And then it’s a thousand little things like this; a vast Spaghetti Tower such that you can’t improve on a small-seeming part of it without throwing a dozen things in other places in disarray. (I’m reminded of the situation in the semiconductor industry here.)
In which case, finding a scalable insight isn’t enough: even integrating this insight requires full end-to-end knowledge of the tech stack and sophisticated research taste; something only senior researchers have.
I think you are somewhat overly fixated on my claim that “maybe the AIs will accelerate the labor input R&D by 10x via basically just being fast and cheap junior employees”. My original claim (in the subcomment) is “I think it could suffice to do a bunch of relatively more banal things extremely fast and cheap”. The “could” part is important. Correspondingly, I think this is only part of the possibilities, though I do think this is a pretty plausible route. Additionally, banal does not imply simple/easy and some level of labor quality will be needed.
(I did propose junior employees as an analogy which maybe implied simple/easy. I didn’t really intend this implication. I think the AIs have to be able to do at least somewhat hard tasks, but maybe don’t need to have a ton of context or have much taste if they can compensate with other advantages.)
I’ll argue against your comment, but first, I’d like to lay out a bunch of background to make sure we’re on the same page and to give a better understanding to people reading through.
Frontier LLM progress has historically been driven by 3 factors:
Increased spending on training runs ($)
Hardware progress (compute / $)
Algorithmic progress (intelligence / compute)
(The split seems to be very roughly 2⁄5, 1⁄5, 2⁄5 respectively.)
If we zoom into algorithmic progress, there are two relevant inputs to the production function:
Compute (for experiments)
Labor (from human researchers and engineers)
A reasonably common view is that compute is a very key bottleneck such that even if you greatly improved labor, algorithmic progress wouldn’t go much faster. This seems plausible to me (though somewhat unlikely), but this isn’t what I was arguing about. I was trying to argue (among other things) that scaling up basically current methods could result in an increase in productivity among OpenAI capabilities researchers at least equivalent to the productivity you’d get as if the human employees operated 10x faster. (In other words, 10x’ing this labor input.)
Now, I’ll try to respond to your claims.
Maybe, but that isn’t exactly a crux in this discussion as noted above. The relevant question is whether the important labor going into ML experiments is more “insights” or “engineering” (not whether both of these are bottlenecked on compute).
My sense is that engineering is most of the labor, and most people I talk to with relevant experience have a view like: “taste is somewhat important, but lots of people have that and fast execution is roughly as important or more important”. Notably, AI companies really want to hire fast and good engineers and seem to care comparably about this as about more traditional research scientist jobs.
One relevant response would be “sure, AI companies want to hire good engineers, but weren’t we talking about the AIs being bad engineers who run fast?”
I think the AI engineers probably have to be quite good at moderate horizon software engineering, but also that scaling up current approaches can pretty likely achieve this. Possibly my “junior hire” analogy was problematic as “junior hire” can mean not as good at programming in addition to “not as much context at this company, but good at the general skills”.
I wasn’t saying that these AIs would mostly be 10x’ing the number of small-scale experiments, though I do think that increasing the number and serial speed of experiments is an important part of the picture.
There are lots of other things that engineers do (e.g., increase the efficiency of experiments so they use less compute, make it much easier to run experiments, etc.).
Sure, but we have to be quantitative here. As a rough (and somewhat conservative) estimate, if I were to manage 50 copies of 3.5 Sonnet who are running 1⁄4 of the time (due to waiting for experiments, etc), that would cost roughly 50 copies * 70 tok / s * 1 / 4 uptime * 60 * 60 * 24 * 365 sec / year * (15 / 1,000,000) $ / tok = $400,000. This cost is comparable to salaries at current compute prices and probably much less than how much AI companies would be willing to pay for top employees. (And note this is after API markups etc. I’m not including input prices for simplicity, but input is much cheaper than output and it’s just a messy BOTEC anyway.)
Yes, this compute comes directly at the cost of experiments, but so do employee salaries at current margins. (Maybe this will be less true in the future.)
At the point when AIs are first capable of doing the relevant tasks, it seems likely it is pretty expensive, but I expect costs to drop pretty quickly. And, AI companies will have far more compute in the future as this increases at a rapid rate, making the plausible number of instances substantially higher.
I think AI companies would be very happy to hire lots of software engineers who work for nearly free, run 10x faster, work 24⁄7, and are pretty good research engineers. This seems especially true if you add other structural advantages of AI into the mix (train once and use many times, fewer personnel issues, easy to scale up and down, etc). The serial speed is very important.
(The bar of “competent-ish” seems too low. Again, I think “junior” might have been leading you astray here, sorry about that. Imagine more like median AI company engineering hire or a bit better than this. My original comment said “automating research engineering”.)
I’m not sure I buy this claim about current tools. Also, I wasn’t making a claim about AIs just doing simple tasks (banal does not mean simple) as discussed earlier.
Stepping back from engineering vs insights, my sense is that it isn’t clear that the AIs will be terrible at insights or broader context. So, I think it will probably be more like they are very fast engineers and ok at experimental direction. Being ok helps a bunch by avoiding the need for human intervention at many points.
Maybe a relevant crux is: “Could scaling up current methods yield AIs that can mostly autonomously automate software engineering tasks that are currently being done by engineers at AI companies?” (More precisely, succeed at these tasks very reliably with only a small amount of human advice/help amortized over all tasks. Probably this would partially work by having humans or AIs decompose into relatively smaller subtasks that require a bit less context, though this isn’t notably different from how humans do things themselves.)
But, I think you maybe also have a further crux like: “Does making software engineering at AI companies cheap and extremely fast greatly accelerate the labor input to AI R&D?”
Yup, those two do seem to be the cruxes here.
You’re right, that’s a meaningfully different claim and I should’ve noticed the difference.
I think I would disagree with it as well. Suppose we break up this labor into, say,
“Banal” software engineering.
Medium-difficult systems design and algorithmic improvements (finding optimizations, etc.).
Coming up with new ideas regarding how AI capabilities can be progressed.
High-level decisions regarding architectures, research avenues and strategies, etc. (Not just inventing transformers/the scaling hypothesis/the idea of RL-on-CoT, but picking those approaches out of a sea of ideas, and making the correct decision to commit hard to them.)
In turn, the factors relevant to (4) are:
(a) The serial thinking of the senior researchers and the communication/exchange of ideas between them.
(Where “the senior researchers” are defined as “the people with the power to make strategic research decisions at a given company”.)
(b) The outputs of significant experiments decided on by the senior researchers.
(c) The pool of untested-at-large-scale ideas presented to the senior researchers.
Importantly, in this model, speeding up (1), (2), (3) can only speed up (4) by increasing the turnover speed of (b) and the quality of (c). And I expect that non-AGI-complete AI cannot improve the quality of ideas (3) and cannot directly speed up/replace (a)[1], meaning any acceleration from it can only come from accelerating the engineering and the optimization of significant experiments.
Which, I expect, are in fact mostly bottlenecked by compute, and 10x’ing the human-labor productivity there doesn’t 10x the overall productivity of the human-labor input; it remains stubbornly held up by (a). (I do buy that it can significantly speed it up, say 2x it. But not 10x it.)
Separately, I’m also skeptical that near-term AI can speed up the nontrivial engineering involved in medium-difficult systems design and the management of significant experiments:
It seems to me that AIs have remained stubbornly terrible at this from GPT-3 to GPT-4 to Sonnet 3.5.1 to o1[2]; that the improvement on this hard-to-specify quality has been ~0. I guess we’ll see if o3 (or an o-series model based on the next-generation base model) change that. AI does feel right on the cusp of getting good at this...
… just as it felt at the time of GPT-3.5, and GPT-4, and Sonnet 3.5.1, and o1. That just the slightest improvement along this axis would allow us to plug the outputs of AI cognition into its inputs and get a competent, autonomous AI agent.
And yet here we are, still.
It’s puzzling to me and I don’t quite understand why it wouldn’t work, but based on the previous track record, I do in fact expect it not to work.
In other words: If an AI is able to improve the quality of ideas and/or reliably pluck out the best ideas from a sea of them, I expect that’s AGI and we can throw out all human cognitive labor entirely.
Arguably, no improvement since GPT-2; I think that post aged really well.
Huh, I disagree reasonably strongly with this. Possible that something along these lines is an empirically testable crux.
FWIW my vibe is closer to Thane’s. Yesterday I commented that this discussion has been raising some topics that seem worthy of a systematic writeup as fodder for further discussion. I think here we’ve hit on another such topic: enumerating important dimensions of AI capability – such as generation of deep insights, or taking broader context into account – and then kicking off a discussion of the past trajectory / expected future progress on each dimension.
Some benchmarks got saturated across this range, so we can imagine “anti-saturated” benchmarks that didn’t yet noticeably move from zero, operationalizing intuitions of lack of progress. Performance on such benchmarks still has room to change significantly even with pretraining scaling in the near future, from 1e26 FLOPs of currently deployed models to 5e28 FLOPs by 2028, 500x more.
If you were to spend equal amounts of money on LLM inference and GPUs, that would mean that you’re spending $400,000 / year on GPUs. Divide that 50 ways and each Sonnet instance gets an $8,000 / year compute budget. Over the 18 hours per day that Sonnet is waiting for experiments, that is an average of $1.22 / hour, which is almost exactly the hourly cost of renting a single H100 on Vast.
So I guess the crux is “would a swarm of unreliable researchers with one good GPU apiece be more effective at AI research than a few top researchers who can monopolize X0,000 GPUs for months, per unit of GPU time spent”.
(and yes, at some point it the question switches to “would an AI researcher that is better at AI research than the best humans make better use of GPUs than the best humans” but a that point it’s a matter of quality, not quantity)
Sure, but I think that at the relevant point, you’ll probably be spending at least 5x more on experiments than on inference and potentially a much larger larger ratio if heavy test time compute usage isn’t important. I was just trying to argue that the naive inference cost isn’t that crazy.
Notably, if you give each researcher 2k gpu hours, that would be $2 / gpu hour * 2k * 24 * 365 = $35,040,000 per year which is much higher than the inference cost of the models!
My understanding is that these two claims are mostly false in practice. In particular, there have been a few studies (like e.g. this) which try to run yesterday’s algorithms with today’s scale, and today’s algorithms with yesterday’s scale, in order to attribute progress to scale vs algorithmic improvements. I haven’t gone through those studies in very careful detail, but my understanding is that they pretty consistently find today’s algorithms outperform yesterday’s algorithms even when scaled down, and yesterday’s algorithms underperform today’s even when scaled up. So unless I’ve badly misunderstood those studies, the mental model in which different tricks work best on different scales is basically just false, at least at the range of different scales the field has gone through in the past ~decade.
That said, there are cases where I could imagine Ilya’s claim making sense, e.g. if the “experiments” he’s talking about are experiments in using the net rather than training the net. Certainly one can do qualitatively different things with GPT4 than GPT2, so if one is testing e.g. a scaffolding setup or a net’s ability to play a particular game, then one needs to use the larger net. Perhaps that’s what Ilya had in mind?
What I had in mind is something along these lines. More capable models[1] have various emergent properties. Specific tricks can rely on those properties being present, and work better or worse depending on that.
For example, the o-series training loop probably can’t actually “get off the ground” if the base model is only as smart as GPT-2: the model would ~never find its way to correct answers, so it’d never get reinforcement signals. You can still force it to work by sampling a billion guesses or by starting it with very easy problems (e. g., basic arithmetics?), but it’d probably deliver much less impressive results than applying it to GPT-4.
Scaling further down: I don’t recall if GPT-2 can make productive use of CoTs, but presumably e. g. GPT-1 can’t. At that point, the whole “do RL on CoTs” completely ceases to be a meaningful thing to try.
Generalizing: At a lower level of capabilities, there’s presumably a ton of various tricks that deliver a small bump to performance. Some of those tricks would have an effect size comparable to RL-on-CoTs-if-applied-at-this-scale. But out of a sea of those tricks, only a few of them would be such that their effectiveness rises dramatically with scale.
So, a more refined way to make my points would be:
If a trick shows promise at a small capability level, e. g. improving performance 10%, it doesn’t mean it’d show a similar 10%-improvement if applied at a higher capability level.
(Say, because it addresses a deficiency that a big-enough model just doesn’t have/which a big-enough pretraining run solves by default.)
If a trick shows marginal/no improvement at a small capability level, that doesn’t mean it won’t show a dramatic improvement at a higher capability level.
My guess, based on the above, would be that even if today’s algorithms perform better than yesterday’s algorithms at smaller scales, the difference between their small-scale capabilities is less than the difference between yesterday’s algorithms and today’s algorithms at bigger scales. I. e.: some algorithms make nonlinearly better use of compute, such that figuring out which tricks are the best is easier at larger scales. (Telling apart a 5% capability improvement from a 80% one.)
Whether they’re more capable by dint of being bigger (GPT-4), or being trained on better data (Sonnet 3.5.1), or having a better training loop + architecture (DeepSeek V3), etc.
Thanks for the mention Thane. I think you make excellent points, and agree with all of them, to some degree. Yet, I’m expecting huge progress in AI algorithms to be unlocked by AI reseachers.
I’ll quote from my comments on the other recent AI timeline discussion.
I think that Ilya and the AGI labs are part of a school of thought that is very focused on tweaking the existing architecture slightly. This then is a researcher-time-light and compute-heavy paradigm.
I think the big advancements require going further afield, outside the current search-space of the major players.
Which is not to say that I think LLMs have to be thrown out as useless. I expect some kind of combo system to work. The question is, combined with what?
Well, my prejudice as someone from a neuroscience background is that I think there are untapped insights from studying the brain.
Look at the limitations of current AI that François Chollet discusses in his various interviews and lectures. I think he’s pointing at real flaws. Look how many data points it takes a typical ML model to learn a new task! How limited in-context learning is!
Brains are doing something different clearly. I think our current models are much more powerful than a mouse brain, and yet there are some things that mice learn better.
So, if you stopped spending your compute on big expensive experiments, and instead spent it on combing through the neuroscience literature looking for clues… Would the AI reseachers make a breakthrough? My guess is yes.
I also suspect that there are ideas in computer science, paths not yet explored with modern compute, that are hiding revolutionary insights. But to find them you’d need to go way outside the current paradigm. Set deep learning entirely aside and look at fundamental ideas. I doubt that this describes even 1% of the current time being spent by researchers currently at the big companies. Their path seems to be working, why should they look elsewhere? The cost to them personally of reorienting to entirely different fields of research would be huge. Not so for AI reseachers. They can search everything, and quickly.
Oh, I very much agree. But any associated software engineering and experiments would then be nontrivial, ones involving setting up a new architecture, correctly interpreting when it’s not working due to a bug vs. because it’s fundamentally flawed, figuring out which tweaks are okay to make and which tweaks would defeat the point of the experiment, et cetera. Something requiring sophisticated research taste; not something you can trivially delegate-and-forget to a junior researcher (as per @ryan_greenblatt’s vision). (And importantly, if this can be delegated to (AI models isomorphic to) juniors, this is something AGI labs can already do just by hiring juniors.)
Same regarding looking for clues in neuroscience/computer-science literature. In order to pick out good ideas, you need great research taste and plausibly a bird’s eye view on the entire hardware-software research stack. I wouldn’t trust a median ML researcher/engineer’s summary; I would expect them to miss great ideas while bringing slop to my attention, such that it’d be more time-efficient to skim over the literature myself.
In addition, this is likely also a part is where “95% of progress comes from the ability to run big experiments” comes into play. Tons of novel tricks/architectures would perform well at a small scale and flounder at a big scale, or vice versa. You need to pick a new approach and go hard on trying to make it work, not just lazily throw an experiment at it. Which is something that’s bottlenecked on the attention of a senior researcher, not a junior worker.
Overall, it sounds as if… you expect dramatically faster capabilities progress from the AGI labs pivoting towards exploring a breadth of new research directions, with the whole “AI researchers” thing being an unrelated feature? (They can do this pivot with or without them. And as per the compute-constraints arguments, borderline-competent AI researchers aren’t going to nontrivially improve on the companies’ ability to execute this pivot.)
So, I’ve been focusing on giving more of a generic view in my comments. Something that I think someone with similar background in neuroscience, and similar background in ML would endorse as roughly plausible.
I also have an inside view which says more specific things. Like, I don’t just vaguely think that there are probably some fruitful directions in neglected parts of computer science history and in recent neuroscience. What I actually have are specific hypotheses that I’ve been working hard on trying to code up experiments for.
If someone gave me engineering support and compute sufficient to actually get my currently planned experiments run, and the results looked like dead-ends, I think my timelines would go from 2-3 years out to 5-10 years. I’d also be much less confident that we’d see rapid efficiency and capability gains from algorithmic research post-AGI, because I’d be more in mindset of minor tweaks to existing paradigms and further expensive scaling.
This is why I’m basically thinking that I mostly agree with you, Thane, except for this inside view I have about specific approaches I think are currently neglected but unlikely to stay neglected.
Yeah, pretty much. Although I don’t expect this with super high confidence. Maybe 75%?
This is part of why I think a “pause” focused on large models / large training runs would actually dangerously accelerate progress towards AGI. I think a lot of well-resourced high-skill researchers would suddenly shift their focus onto breadth of exploration.
Another point:
I don’t think we’ll ever see AI agents that are exactly isomorphic to junior researchers. Why? Because of the weird spikiness of skills we see. In some ways the LLMs we have are much more skillful than junior researchers, in other ways they are pathetically bad. If you held their competencies constant except for improving the places where they are really bad, you’d suddenly have assistants much better than the median junior!
So when considering the details of how to apply the AI assistants we’re likely to get (based on extrapolating current spiky skill patterns), the set of affordances this offers to the top researchers is quite different from what having a bunch of juniors would be. I think this means we should expect things to weirder and less smooth than Ryan’s straightforward speed-up prediction.
If you look at the recent AI scientist work that’s been done you find this weird spiky portfolio. Having LLMs look through a bunch of papers and try to come up with new research directions? Mostly, but not entirely crap… But then since it’s relatively cheap to do, and quick to do, and not too costly to filter, the trade-off ends up seeming worthwhile?
As for new experiments in totally new regimes, yeah. That’s harder for current LLMs to help with than the well-trodden zones. But I think the specific skills currently beginning to be unlocked by the o1/o3 direction may be enough to make coding agents reliable enough to do a much larger share of this novel experiment setup.
So… It’s complicated. Can’t be sure of success. Can’t be sure of a wall.
I see a bunch of good questions explicitly or implicitly posed here. I’ll touch on each one.
1. What level of capabilities would be needed to achieve “AIs that 10x AI R&D labor”? My guess is, pretty high. Obviously you’d need to be able to automate at least 90% of what capabilities researchers do today. But 90% is a lot, you’ll be pushing out into the long tail of tasks that require taste, subtle tacit knowledge, etc. I am handicapped here by having absolutely no experience with / exposure to what goes on inside an AI research lab. I have 35 years of experience as a software engineer but precisely zero experience working on AI. So on this question I somewhat defer to folks like you. But I do suspect there is a tendency to underestimate how difficult / annoying these tail effects will be, this is the same fundamental principle as Hofstadter’s Law, the Programmer’s Credo, etc.
I have a personal suspicion that a surprisingly large fraction of work (possibly but not necessarily limited to “knowledge work”) will turn out to be “AGI complete”, meaning that it will require something approaching full AGI to undertake it at human level. But I haven’t really developed this idea beyond an intuition. It’s a crux and I would like to find a way to develop / articulate it further.
2. What does it even mean to accelerate someone’s work by 10x? It may be that if your experts are no longer doing any grunt work, they are no longer getting the input they need to do the parts of their job that are hardest to automate and/or where they’re really adding magic-sauce value. Or there may be other sources of friction / loss. In some cases it may over time be possible to find adaptations, in other cases it may be a more fundamental issue. (A possible counterbalance: if AIs can become highly superhuman at some aspects of the job, not just in speed/cost but in quality of output, that could compensate for delivering a less-than-10x time speedup on the overall workflow.)
3. If AIs that 10x AI R&D labor are 20% likely to arrive and be adopted by Jan 2027, would that update my view on the possibility of AGI-as-I-defined-it by 2030? It would, because (per the above) I think that delivering that 10x productivity boost would require something pretty close to AGI. In other words, conditional on AI R&D labor being accelerated 10x by Jan 2027, I would expect that we have something close to AGI by Jan 2027, which also implies that we were able to make huge advances in capabilities in 24 months. Whereas I think your model is that we could get that level of productivity boost from something well short of AGI.
If it turns out that we get 10x AI R&D labor by Jan 2027 but the AIs that enabled this are pretty far from AGI… then my world model is very confused and I can’t predict how I would update, I’d need to know more about how that worked out. I suppose it would probably push me toward shorter timelines, because it would suggest that “almost all work is easy” and RSI starts to really kick in earlier than my expectation.
4. Is this 10x milestone achievable just by scaling up existing approaches? My intuition is no. I think that milestone requires very capable AI (items 1+2 in this list). And I don’t see current approaches delivering much progress on things I think will be needed for such capability, such as long-term memory, continuous learning, ability to “break out of the chatbox” and deal with open-ended information sources and extraneous information, or other factors that I mentioned in the original post.
I am very interested in discussing any or all of these questions further.
Actually, I don’t think so. AIs don’t just substitute for human researchers, they can specialize differently. Suppose (for simplicity) there are 2 roughly equally good lines of research that can substitute (e.g. they create some fungible algorithmic progress) and capability researchers currently do 50% of each. Further, suppose that AIs can 30x accelerate the first line of research, but are worthless for the second. This could yield >10x acceleration via researchers just focusing on the first line of research (depending on how diminishing returns go).
This doesn’t make a huge difference to my bottom line view, but it seems likely that this sort of change in specialization makes a 2x difference.
I think it could suffice to do a bunch of relatively more banal things extremely fast and cheap. In particular, it could suffice to do: software engineering, experiment babysitting, experiment debugging, optimization, improved experiment interpretation (e.g., trying to identify the important plots and considerations and presenting as concisely and effectively as possible), and generally checking experiment prior to launching them.
As an intution pump, imagine you had nearly free junior hires who run 10x faster and also work all hours. Because they are free, you can run tons of copies. I think this could pretty plausibly speed things up by 10x.
I’m not sure if I exactly disagree, but I do think there is a ton of variation in the human range such that I dispute the way you seem to use “AGI complete”. I do think that the systems doing this acceleration will be quite general and capable and will be in some sense close to AGI. (Though less so if this occurs earlier like in my 20th percentile world.)
Suppose a company specifically trained an AI system to be very familiar with its code base and infrastructure and relatively good at doing experiments for it. Then, it seems plausible that (with some misc schlep) the only needed context would be project specific context. It seems pretty plausible you can fit the context for tasks humans would do in a week into a 1 million token context window especially with some tweaks and some forking/sub-agents. And automating 1 week seems like it could suffice for big acceleration depending on various details. (Concretely, code is roughly 10 tokens per line, we might expect the AI to write <20k lines including revision, commands etc and to receive not much more than this amount of input. Books are maybe 150k tokens for reference, so the question is whether the AI needs over 6 books of context for 1 week for work. Currently, when AIs automate longer tasks they often do so via fewer steps than humans, spitting out the relevant outputs more directly, so I expect that the context needed for the AI is somewhat less.) Of course, it isn’t clear that models will be able to use their context window as well as humans use longer term memory.
As far as continuous learning, what if the AI company does online training of their AI systems based on all internal usage[1]? (Online training = just RL train on all internal usage based on human ratings or other sources of feedback.) Is the concern that this will be too sample inefficent (even with proliferation or other hacks)? (I don’t think it is obvious this goes either way but a binary “no continuous learning method is known” doesn’t seem right to me.)
Confidentiality concerns might prevent training on literally all internal usage.
Thanks for engaging so deeply on this!
Good point, this would have some impact.
Wouldn’t you drown in the overhead of generating tasks, evaluating the results, etc.? As a senior dev, I’ve had plenty of situations where junior devs were very helpful, but I’ve also had plenty of situations where it was more work for me to manage them than it would have been to do the job myself. These weren’t incompetent people, they just didn’t understand the situation well enough to make good choices and it wasn’t easy to impart that understanding. And I don’t think I’ve ever been sole tech lead for a team that was overall more than, say, 5x more productive than I am on my own – even when many of the people on the team were quite senior themselves. I can’t imagine trying to farm out enough work to achieve 10x of my personal productivity. There’s only so much you can delegate unless the system you’re delegating to has the sort of taste, judgement, and contextual awareness that a junior hire more or less by definition does not. Also you might run into the issue I mentioned where the senior person in the center of all this is no longer getting their hands dirty enough to collect the input needed to drive their high-level intuition and do their high-value senior things.
Hmm, I suppose it’s possible that AI R&D has a different flavor than what I’m used to. The software projects I’ve spent my career on are usually not very experimental in nature; the goal is generally not to learn whether an idea shows promise, it’s to design and implement code to implement a feature spec, for integration into the production system. If a junior dev does a so-so job, I have to work with them to bring it up to a higher standard, because we don’t want to incur the tech debt of integrating so-so code, we’d be paying for it for years. Maybe that plays out differently in AI R&D?
Incidentally, in this scenario, do you actually get to 10x the productivity of all your staff? Or do you just get to fire your junior staff? Seems like that depends on the distribution of staff levels today and on whether, in this world, junior staff can step up and productively manage AIs themselves.
These are fascinating questions but beyond what I think I can usefully contribute to in the format of a discussion thread. I might reach out at some point to see whether you’re open to discussing further. Ultimately I’m interested in developing a somewhat detailed model, with well-identified variables / assumptions that can be tested against reality.
I’ve had a pretty similar experience personally but:
I think serial speed matters a lot and you’d be willing to go through a bunch more hassle if the junior devs worked 24⁄7 and at 10x speed.
Quantity can be a quality of its own—if you have truely vast (parallel) quantities of labor, you can be much more demanding and picky. (And make junior devs do much more work to understand what is going on.)
I do think the experimentation thing is probably somewhat big, but I’m uncertain.
(This one is breaking with the junior dev analogy, but whatever.) In the AI case, you can train/instruct once and then fork many times. In the analogy, this would be like you spending 1 month training the junior dev (who still works 24⁄7 and at 10x speed, so 10 months for them) and then forking them into many instances. Of course, perhaps AI sample efficiency is lower. However, my personal guess is that lots of compute spent on learning and aggressive schlep (e.g. proliferation, lots of self-supervised learning, etc) can plausibly substantially reduce or possibly eliminate the gap (at least once AIs are more capable) similar to how it works for EfficientZero.