I’m the chief scientist at Redwood Research.
ryan_greenblatt
For context, my preregistered guess would be that AI R&D speed ups along the way to superhuman coder make it come around 1.5x faster, though between 1.25-2 all are consistent with my best guess. (So e.g., rather than ~2029.0 median on Eli’s model without intermediate AI R&D speed ups we’d see around 2031.0 or so. I’d expect a bigger effect on the 10th percentile due to uncertainty.)
Sure. Epoch estimates 2e25 flop for GPT-4 and 3.4e24 for deepseek V3. So a bit less than 10x actually, but quite close. (And V3 is substantially better.) R1 is around 1⁄6 of deepseek V3 cost.
I said there was no compute acceleration not that there was no more compute scaling?
(Yes, sorry, edited my original comment to clarify.)
a equal and super-accelerated algorithmic or efficiency term (v_algorithmic) as shown in the code here
I don’t think the “AI assisted AI R&D” speed ups along the way to superhuman coder make a huge difference to the bottom line?
In all timelines models presented there is acknowledgement that compute does not accelerate.
When you say “accelerate” do you mean “the rate of compute scaling increases”? I agree they aren’t expecting this (and roughly model a fixed rate of compute progress which matches historical trends as this is just an extrapolation). However, note that the superexponentiality in the timelines model isn’t based on this, and is instead based on views about the mapping from effective compute to horizon length.
If you just mean “the timelines model assumes no compute scaling”, then I think this is a clear no?
The basic timelines model is based on extrapolating out the current trend of AI progress.
Where you do you see mention of no further compute scaling in this timelines model: https://ai-2027.com/research/timelines-forecast?
I agree that the takeoff model focuses on the regime without compute scaling as the takeoff occurs over a year which doesn’t allow that much compute scaling (though I believe the final takeoff numbers / scenario are accounting for compute scaling).
zero-shot WikiText103 perplexity and 5-shot MMLU
These are somewhat awkward benchmarks because they don’t actually measure downstream usefulness at software engineering or AI research. In particular, these tasks might not measure improvements in RL which can have huge effects on usefulness and have seen fast algorithmic progress.
Can we instead use SWE-bench or METR’s task suite?
For instance, here is a proposed bet:
GPT-4 was released in March 2023 (2 years ago). So, we’d expect a model which used 10x less FLOP to perform similarly well (or better) on agentic tasks (like SWE-bench or METR’s task suite).
Oh wait, there already is such a model! Deepseek-V3 / R1 is IMO clearly better than GPT-4 on these tasks (and other tasks) while using <10x GPT-4 flop and being released within 2 years. So bet resolved?
Edit: more like 6x less flop actually, so this is a bit messy and would need to lean on better performance. People don’t seem to bother training compute optimal models with ~10x less than GPT-4 flop models these days...
Actually, I think Deepseek-V3 also does better than GPT-4 on MMLU, though we can’t compare perplexity. So, seems ~resolved either way, at least for progress in the last 2 years and if you’re fine with assuming that DeepSeek-V3 isn’t rigged or using distillation.
The delta is much more extreme if instead of looking at software engineering you look at competitive programming or math.
This is both projected forward and treated with either 1 (in 45% of cases) or 2 (in all cases) super-exponential terms that make it significantly faster than an inferred 4.6x per year.
Hmm, I think you’re looking at the more basic trend extrapolation for the timelines model and assuming that the authors are thinking that this trend extrapolation is purely due to algorithms?
(The authors do model this trend accelerating due to AIs accelerating algorithms, so if the rate of algorithmic progress was much lower, that would make a big difference to the bottom line.)
I do agree that “how fast is algorithmic progress right now” might be a crux and presumably the authors would think differently if they thought algorithmic progress was much faster.
This seems relatively clearly false in the case of competition programming problems. Concretely, o3 with 50 submissions beats o1 with 10k submissions. (And o1 is presumably much better than the underlying instruct model.)
I’d guess this paper doesn’t have the actual optimal methods.
Another way to put this disagreement is that you can interpret all of the AI 2027 capability milestones as refering to the capability of the weakest bottlenecking capability, so:
Superhuman coder has to dominate all research engineers at all pure research engineering tasks. This includes the most bottlenecking capability.
SAR has to dominate all human researchers, which must include whatever task would otherwise bottleneck.
SIAR (superintelligent AI research) has to be so good at AI research—the gap between SAR and SIAR is 2x the gap between an automated median AGI company researcher and a SAR—that it has this huge 2x gap advantage over the SAR despite the potentially bottlenecking capabilities.
So, I think perhaps what is going on is that you mostly disagree with the human-only, software-only times and are plausibly mostly on board otherwise.
I’m having trouble parsing this sentence
You said “This is valid for activities which benefit from speed and scale. But when output quality is paramount, speed and scale may not always provide much help?”. But, when considering activities that aren’t bottlenecked on the environment, then to achieve 10x acceleration you just need 10 more speed at the same level of capability. In order for quality to be a crux for a relative speed up, there needs to be some environmental constraint (like you can only run 1 experiment).
Is that a fair statement?
Yep, my sense is that an SAR has to[1] be better than humans at basically everything except vision.
(Given this, I currently expect that SAR comes at basically the same time as “superhuman blind remote worker”, at least when putting aside niche expertise which you can’t learn without a bunch of interaction with humans or the environment. I don’t currently have a strong view on the difficulty of matching human visual abilites, particulary at video processing, but I wouldn’t be super surprised if video processing is harder than basically everything else ultimately.)
If “producing better models” (AI R&D) requires more than just narrow “AI research” skills, then either SAR and SAIR need to be defined to cover that broader skill set (in which case, yes, I’d argue that 1.5-10 years is unreasonably short for unaccelerated SC->SAR),
It is defined to cover the broader set? It says “An AI system that can do the job of the best human AI researcher?” (Presumably this is implicitly “any of the best AI researchers which presumably need to learn misc skills as part of their jobs etc.) Notably, Superintelligent AI researcher (SIAR) happens after “superhuman remote worker” which requires being able to automate any work a remote worker could do.
I’m guessing your crux is that the time is too short?
- ↩︎
“Has to” is maybe a bit strong, I think I probably should have said “will probably end up needing to be better competitive with the best human experts at basically everything (other than vision) and better at more central AI R&D given the realistic capability profile”. I think I generally expect full automation to hit everywhere all around the same time putting aside vision and physical tasks.
- ↩︎
I’m worried that you’re missing something important because you mostly argue against large AI R&D multipliers, but you don’t spend much time directly referencing compute bottlenecks in your arguments that the forecast is too aggressive.
Consider the case of doing pure math research (which we’ll assume for simplicity doesn’t benefit from compute at all). If we made emulated versions of the 1000 best math researchers and then we made 1 billion copies of each of them them which all ran at 1000x speed, I expect we’d get >1000x faster progress. As far as I can tell, the words in your arguments don’t particularly apply less to this situation than the AI R&D situation.
Going through the object level response for each of these arguments in the case of pure math research and the correspondence to the AI R&D:
Simplified Model of AI R&D
Math: Yes, there are many tasks in math R&D, but the 1000 best math researchers could already do them or learn to do them.
AI R&D: By the time you have SAR (superhuman AI researcher), we’re assuming the AIs are better than the best human researchers(!), so heterogenous tasks don’t matter if you accept the premise of SAR: whatever the humans could have done, the AIs can do better. It does apply to the speed ups at superhuman coders, but I’m not sure this will make a huge difference to the bottom line (and you seem to mostly be referencing later speed ups).
Amdahl’s Law
Math: The speed up is near universal because we can do whatever the humans could do.
AI R&D: Again, the SAR is strictly better than humans, so hard-to-automate activities aren’t a problem. When we’re talking about ~1000x speed up, the authors are imagining AIs which are much smarter than humans at everything and which are running 100x faster than humans at immense scale. So, “hard to automate tasks” is also not relevant.
All this said, compute bottlenecks could be very important here! But the bottlenecking argument must directly reference these compute bottlenecks and there has to be no way to route around this. My sense is that much better research taste and perfect implementation could make experiments with some fixed amount of compute >100x more useful. To me, this feels like the important question: how much can labor results in routing around compute bottlenecks and utilizing compute much more effectively. The naive extrapolation out of the human range makes this look quite aggressive: the median AI company employee is probably 10x worse at using compute than the best, so an AI which as superhuman as 2x the gap between median and best would naively be 100x better at using compute than the best employee. (Is the research taste ceiling plausibly this high? I currently think extrapolating out another 100x is reasonable given that we don’t see things slowing down in the human range as far as we can tell.)
Dependence on Narrow Data Sets
This is only applicable to the timeline to the superhuman coder milestone, not to takeoff speeds once we have a superhuman coder. (Or maybe you think a similar argument applies to the time between superhuman coder and SAR.)
Hofstadter’s Law As Prior
Math: We’re talking about speed up relative to what the human researchers would have done by default, so this just divides both sides equally and cancels out.
AI R&D: The should also just divide both sides. That said, Hofstadter’s Law does apply to the human-only, software-only times between milestones. But note that these times are actually quite long! (Maybe you think they are still too short, in which case fair enough.)
Sure, but for output quality better than what humans could (ever) do to matter for the relative speed up, you have to argue about compute bottlenecks, not Amdahl’s law for just the automation itself! (As in, if some humans would have done something in 10 years and it doesn’t have any environmental bottleneck, then 10x faster emulated humans can do it in 1 year.)
My mental model is that, for some time to come, there will be activities where AIs simply aren’t very competent at all,
Notably, SAR is defined as “Superhuman AI researcher (SAR): An AI system that can do the job of the best human AI researcher but faster, and cheaply enough to run lots of copies.” So, it is strictly better than the best human researcher(s)! So, your statement might be true, but is irrelevant if we’re conditioning on SAR.
It sounds like your actual objection is in the human-only, software-only time from superhuman coder to SAR (you think this would take more than 1.5-10 years).
Or perhaps your objection is that you think there will be a smaller AI R&D multiplier for superhuman coders. (But this isn’t relevant once you hit full automation!)
I note that I am confused by this diagram. In particular, the legend indicates a 90th percentile forecast of “>2100” for ASI, but the diagram appears to show the probability dropping to zero around the beginning of 2032.
I think it’s just that the tail is very long and flat with <1% per year. So, it looks like it goes to zero, but it stays just above.
Hmm, I think your argument is roughly right, but missing a key detail. In particular, the key aspect of the SARs (and higher levels of capability) is that they can be strictly better than humans at everything while simultaneously being 30x faster and 30x more numerous. (Or, there is 900x more parallel labor, but we can choose to run this as 30x more parallel instances each running 30x faster.)
So, even if these SARs are only slightly better than humans at these 10 activities and these activities don’t benefit from parallelization at all, they can still do them 30x faster!
So, progress can actually be accelerated by up to 3000x even if the AIs are only as good as humans at these 10 activities and can’t productively dump in more labor.
In practice, I expect that you can often pour more labor into whatever bottlenecks you might have. (And compensate etc as you noted.)
By the time the AIs have a 1000x AI R&D multiplier, they are running at 100x human speed! So, I don’t think the argument for “you won’t get 1000x uplift” can come down to amdahl’s law argument for automation itself. It will have to depend on compute bottlenecks.
(My sense is that the progress multipliers in AI 2027 are too high but also that the human-only times between milestones are somewhat too long. On net, this makes me expect somewhat slower takeoff with a substantial chance on much slower takeoff.)
(Yes, I’m aware you meant imprecise probabilities. These aren’t probablities though (in the same sense that a range of numbers isn’t a number), e.g., you’re unwilling to state a median.)
It’s notable that you’re just generally arguing against having probabilistic beliefs about events which are unprecedented[1], nothing is specific to this case of doing AI forecasting. You’re mostly objecting to the idea of having (e.g.) medians on events like this.
- ↩︎
Of course, the level of precedentedness is continous and from understanding forecasters have successfully done OK at predicting increasingly unprecedented events. Maybe your take is that AI is the most unprecedented event anyone has ever tried to predict. This seems maybe plausible.
- ↩︎
I wouldn’t be surprised if 3-5% of questions were mislabeled or impossible to answer, but 25-50%? You’re basically saying that HLE is worthless. I’m curious why.
Various people looked at randomly selected questions and found similar numbers.
(I don’t think the dataset is worthless, I think if you filtered down to the best 25-50% of questions it would be a reasonable dataset with acceptable error rate.)
Isn’t it kinda unreasonable to put 10% on superhuman coder in a year if current AIs have a 15 nanosecond time horizon? TBC, it seems fine IMO if the model just isn’t very good at predicting the 10th/90th percentile, especially with extreme hyperparameters.
I also don’t know how they ran this, I tried looking for model code and I couldn’t find it.(Edit: found the code.)
Looks like Eli beat me to the punch!
This contradicts METR timelines, which, IMO, is the best piece of info we currently have for predicting when AGI will arrive.
Have you read the timelines supplement? One of their main methodologies involves using this exact data from METR (yielding 2027 medians). The key differences they have from the extrapolation methodology used by METR are: they use a somewhat shorter doubling time which seems closer to what we see in 2024 (4.5 months median rather than 7 months) and they put substantial probability on the trend being superexponential.
why the timelines will be much longer
I think the timelines to superhuman coder implied by METR’s work are closer to 2029 than 2027, so 2 more years or 2x longer. I don’t think most people will think of this as much longer, though I guess 2x longer could qualify as much longer.
Considering that frontier LLMs of today can solve at most 20% of problems on Humanity’s Last Exam, both of these predictions appear overly optimistic to me. And HLE isn’t even about autonomous research, it’s about “closed-ended, verifiable questions”. Even if some LLM scored >90% on HLE by late 2025 (I bet this won’t happen), that wouldn’t automatically imply that it’s good at open-ended problems with no known answer. Present-day LLMs have so little agency that it’s not even worth talking about.
I’m not sure that smart humans can solve 20% on Humanity’s Last Exam (HLE). I also think that around 25-50% of the questions are impossible or mislabeled. So, this doesn’t seem like a very effective way to rule out capabilities.
I think scores on HLE are mostly just not a good indicator of the relevant capabilities. (Given our current understanding.)
TBC, my median to superhuman coder is more like 2031.
Just ran the code and it looks like I’m spot on and the median goes to Mar 2031.