https://www.elilifland.com/. You can give me anonymous feedback here. I often change my mind and don’t necessarily endorse past writings.
elifland
I still think full automation of remote work in 10 years is plausible, because it’s what we would predict if we straightforwardly extrapolate current rates of revenue growth and assume no slowdown. However, I would only give this outcome around 30% chance.
In an important sense I feel like Ege and I are not actually far off here. I’m at more like 65-70% on this. I think this overall recommends quite similar actions. Perhaps we have a more important disagreement regarding something like P(AGI within 3 years), for which I’m at approx. 25-30% and Ege might be very low (my probability mass is somewhat concentrated in the next 3 years due to an expectation that compute and algorithmic effort scaling will slow down around 2029 if AGI or close isn’t achieved).
My guess is that this disagreement is less important to make progress on than disagreements regarding takeoff speeds/dynamics and alignment difficulty.
- Apr 28, 2025, 11:36 PM; 2 points) 's comment on ryan_greenblatt’s Shortform by (
Ah right, my bad, I was confused. This is right except that these estimates aren’t software-only, they include recent levels of compute scaling.
Those estimates do start at RE-Bench, but these are all estimates for how long things would take given the “default” pace of progress, rather than the actual calendar time required. Adding them together ends up with a result that doesn’t take into account speedup from AI R&D automation or the slowdown in compute and algorithmic labor growth after 2028.
I think that usually in AI safety lingo people use timelines to mean time to AGI and takeoff to mean something like the speed of progression after AGI.
Thanks for bringing this up, I hadn’t seen this paper.
Before deciding how much time to spend on this I’m trying to understand how much this matters, and am having trouble interpreting your Wolfram Alpha plot. Can you ELI12? I tried having Claude plot our lognormal doubling time distribution against an inverse Gaussian with equivalent mean and variance and it looks very similar, but of course Claude could be messing something up.
but “this variable doesn’t matter to outcomes” is not a valid critique w.r.t. things like “what are current capabilities/time horizon”
Where did I say it isn’t a valid critique? I’ve said both over text and verbally that the behavior in cases where superexponentiality is true isn’t ideal (which makes a bigger difference in the time horizon extension model than benchmarks and gaps).
Perhaps you are saying I said it’s invalid because I also said that it can be compensated some by lowering the p_superexponential at lower time horizons? Saying this doesn’t imply that I think the critique is completely invalid, I still think there is a real issue there. We probably disagree about the magnitude, but again that doesn’t mean I think it’s invalid.
At the very least this concedes that the estimates are not based on trend-extrapolation and are conjecture.
Yes, as I told you verbally, I will edit the relevant expandable to make this more clear. I agree that the way it’s presented currently is poor.
Here are two charts demonstrating that small changes in estimates of current R&D contribution and changes in R&D speedup change the model massively in the absence of a singularity.
These are great, this parameter is indeed at least a bit more important that I expected. I will make this more clear in the writeup, and will think a little more about what the median value should be (it’s very relevant for looking into the original bet offer anyway :)).
First, my argument is not: we had limited time to do this, therefore it’s fine for us to not include whatever factors we want.
My argument is: we had limited time despite putting lots of work into this, because it’s a very ambitiously scoped endeavor. Adding uncertainty to the percent of progress that is software wouldn’t have changed the qualitative takeaways, therefore it’s not ideal but okay for us to present the model without that uncertainty (shifting the median estimate a lower number my have have, I’ll separately reply to your comment on that; we should clearly distinguish these, and your 0th percentile assertions are aimed more at the lack of uncertainty in the model than the median estimate).
That being said, I agree with you that it would be nice and I will likely add uncertainty to our model because it seems like good ROI, I appreciate you pushing me to do this.
The basic arguments are that (a) becoming fully superhuman at something which involves long-horizon agency across a diverse range of situations seems like it requires agency skills that will transfer pretty well to other domains (b) once AIs have superhuman data efficiency, they can pick up whatever domain knowledge they need for new tasks very quickly.
I agree we didn’t justify it thoroughly in our supplement, the reason it’s not justified more is because we didn’t get around to it.
As a prerequisite, it will be necessary to enumerate the set of activities that are necessary for “AI R&D”
As I think you’re aware, Epoch took a decent stab at this IMO here. I also spent a bunch of time thinking about all the sub-tasks involved in AI R&D early on in the scenario development. Tbh, I don’t feel like it was a great use of time compared to thinking at a higher level, but perhaps I was doing it poorly or am underestimating its usefulness.
What is the profile of acceleration across all tasks relating to AI R&D? What percentage of tasks are getting accelerated by 1.1x, 1.5x, 2x?
A late 2024 n=4 survey of frontier AI researchers estimated a median of a 1.15x AI R&D progress multiplier relative to no post-2022 AIs. I’d like to see bigger surveys here but FWIW my best guess is that we’re already at a ~1.1x progress multiplier.
Readers are likely familiar with Hofstadter’s Law:
It always takes longer than you expect, even when you take into account Hofstadter’s Law.
It’s a good law. There’s a reason it exists in many forms (see also the Programmer’s Credo[9], the 90-90 rule, Murphy’s Law, etc.) It is difficult to anticipate all of the complexity and potential difficulties of a project in advance, and on average this contributes to things taking longer than expected. Constructing ASI will be an extremely complex project, and the AI 2027 attempt to break it down into a fairly simple set of milestones and estimate the difficulty of each milestone seems like fertile territory for Hofstadter’s Law.
One reason I don’t put much weight on this for timelines forecasts is that to the extent I might have done so before, I would have been more wrong from my current view. For example, my AGI timelines median 3 years ago was 2060ish, and since then I’ve updated toward an AGI median of more like 2031 due to reasons including underpredicting benchmark scores, underpredicting real-world impacts, and the model we built for AI 2027.
(wow, I didn’t remember that my median 3 years ago was 2060ish, wild)
While current AI models and tools are demonstrating substantial value in the real world, there is nevertheless a notorious gap between benchmark scores (“Ph.D level” and beyond) and real-world applicability. It strikes me as highly plausible that this reflects one or more as-yet-poorly-characterized chasms that may be difficult to cross.
You probably know this, but for onlookers the magnitude of these chasms are discussed in our timelines forecast, method 2.
The authors address this objection, but the counterargument strikes me as flawed. Here is the key paragraph:
To see why this is conceptually mistaken, consider a theoretical AI with very superhuman experiment selection capabilities but sub-human experiment implementation skills. Even if automation didn’t speed up implementation of AI experiments at all and implementation started as 50% of researchers’ time, if automation led to much better experiments being chosen, a >2x AI R&D progress multiplier could be achieved.
In essence, this is saying that if the pace of progress is the product of two factors (experiment implementation time, and quality of experiment choice), then AI only needs to accelerate one factor in order to achieve an overall speedup. However, AI R&D involves a large number of heterogeneous activities, and overall progress is not simply the product of progress in each activity. Not all bottlenecks will be easily compensated for or worked around.
Also remember that we are talking about very large speedups here. In practice, Amdahl’s Law often starts to bite when optimizing a system by factors as small as 2x. Projecting speedups reaching 2000x is “pricing to perfection”; if the ability to route around difficult-to-automate activities is anything short of perfect, progress will fall short of the anticipated speedup curves.
Looking at the quote above, I’ll note that “choosing better experiments” is a relatively advanced skill, which will likely not emerge until well after experiment implementation skills. More generally, the high-level decision-making skills needed to maintain accelerating progress in the face of powerful-but-uneven AI capabilities seem like they would not emerge until late in the game.
You’re bringing up a more sophisitcated objection than the one I was addressing, which didn’t acknowledge things like multiplier effects or being able to shift the task distribution.
Regarding research taste / experiment selection coming well after experiment implementation, I disagree in my median case depending on what you mean by “well after,” due to intuitions I describe in my other comment. Also I’d note that there are some early signs of research taste or management being fairly automatable (take these with pinches of salt, ofc there are limitations!).
Also, we think that research taste is a multiplier which isn’t tied to many distinct sub-activities, curious if you could provide examples of several sub-activities without super correlated capabilities.
The model assumes very high speedup factors (25x to 250x), implying very broad and robust capabilities, quite far in advance of ASI.
FWIW, these are heavily informed by surveys of frontier AI researchers, including a more recent higher sample size survey that we haven’t made public yet but gave similar results to are previous smaller ones (though the recent one was done quicker so may have more definitional confusion issues).
Inevitably, some of these activities will be harder to automate than others, delaying the overall timeline. It seems difficult to route around this problem. For instance, if it turns out to be difficult to evaluate the quality of model outputs for fuzzy / subjective tasks, it’s not clear how an R&D organization (regardless of how much or little automation it has incorporated) could rapidly improve model capabilities on those tasks, regardless of how much progress is being made in other areas.
One reason I expect less jaggeed progress than you is that my intuition is that even for tasks that are theoretically easy to verify/check, if they take a long time for humans and are very valuable, they will still often be hard to automate if there aren’t easily verifiable intermediate outputs. For example, perhaps it’s much easier to automate few hour coding tasks than few hour tasks in less verifiable domains. But for coding tasks that take humans months, it’s not clear that there’s a much better training signal for intermediate outputs than there is for tasks with a less verifiable end state. And if there aren’t easily verifiable intermediate outputs, it seems you face similar challenges to short horizon non-verifiable tasks in terms of getting a good training signal. Furthermore, the sorts of long horizon coding tasks humans do are often inherently vague and fuzzy as well, at a higher rather than shorter ones. It’s less clear how much of an issue this is for math, but for coding this consideration points me toward expecting automation of coding not that much before other fuzzier skills.
- Updates from Comments on “AI 2027 is a Bet Against Amdahl’s Law” by May 2, 2025, 11:52 PM; 40 points) (
- Apr 22, 2025, 5:41 PM; 3 points) 's comment on AI 2027 is a Bet Against Amdahl’s Law by (
As a minor point of feedback, I’d suggest adding a bit of material near the top of the timelines and/or takeoff forecasts, clarifying the range of activities meant to be included in “superhuman coder” and “superhuman AI researcher”, e.g. listing some activities that are and are not in scope. I was startled to see Ryan say “my sense is that an SAR has to be better than humans at basically everything except vision”; I would never have guessed that was the intended interpretation.)
This is fair. To the extent we have chosen what activities to include, it’s supposed to encompass everything that any researcher/engineer currently does to improve AIs’ AI R&D capabilities within AGI companies, see the AI R&D progress multiplier definition: “How much faster would AI R&D capabilities...”. As to whether we should include activities that researchers or engineers don’t do, my instinct is mostly no because the main thing I can think of there is data collection, and that feels like it should be treated separately (in the AI R&D progress multiplier appendix, we clarify that using new models for synthetic data generation isn’t included in the AI R&D progress multiplier as we want to focus on improved research skills, though I’m unsure if that the right choice and am open to changing).
But I did not put a lot of effort into thinking about how exactly to define the range of applicable activities and what domains should be included; My intuition is that it matters less than you think because I expect automation to be less jagged than you (I might write more about that in a separate comment) and because of intuitions that research taste is the key skill and is relatively domain-general, though I agree expertise helps. I agree that there will be varying multipliers depending on the domain, but given that the takeoff forecast is focused mostly on a set of AI R&D-specific milestones, I think it makes sense to focus on that.
- Updates from Comments on “AI 2027 is a Bet Against Amdahl’s Law” by May 2, 2025, 11:52 PM; 40 points) (
- Apr 23, 2025, 3:15 PM; 1 point) 's comment on AI 2027 is a Bet Against Amdahl’s Law by (
Ok yeah, seems like this is just a wording issue and we’re on the same page.
SAR has to dominate all human researchers, which must include whatever task would otherwise bottleneck.
This, and the same description for the other milestones, aren’t completely right; it’s possible that there are some activities on which the SAR is worse. But it can’t be many activities and it can’t be much worse at them, given that the SAR needs to overall be doing the job of the best human researcher 30x faster.
I simply find it impossible to accept this concatenation of intuitive leaps as sufficient evidence to update very far.
Seems like this should depend on how you form your current views on timelines/takeoff. The reason I put a bunch of stock in our forecasts for informing my personal views is that I think, while very flawed, they seem better than any previous piece of evidence or intuition I was including. But probably we just disagree on how to weigh different forms of evidence.
I basically agree. The reason I expect AI R&D automation to happen before the rest of remote work isn’t because I think it’s fundamentally much easier, but because (a) companies will try to automate it before other remote work tasks, and relatedly (b) because companies have access to more data and expertise for AI R&D than other fields.