Okay I got trapped in a Walgreens and read more of this, found something compelling. Emphasis mine:
The best systems today fall short at working out complex problems over longer time horizons, which require some mix of creativity, trial-and-error, and autonomy. But there are signs of rapid improvement: the maximum duration of ML-related tasks that frontier models can generally complete has been doubling roughly every seven months. Naively extrapolating this trend suggests that, within three to six years, AI models will become capable of automating many cognitive tasks which take human experts up to a month.
This is presented without much fanfare but feels like a crux to me. After all, the whole paper is predicated on the idea that AI will be able to effectively replace the work of human researchers. The paragraph has a footnote (44), which reads:
METR, ‘Quantifying the Exponential Growth in AI Ability to Complete Longer Tasks’ (forthcoming). See also Pimpale et al., ‘Forecasting Frontier Language Model Agent Capabilities’.
So the citation is an unreleased paper! That unreleased paper may make a splash, since (assuming this 7-month-doubling trend is not merely 1-2 years old) it strongly implies we really will find good solutions for turning LLMs agentic fairly soon.
(The second paper cited, only a couple weeks old itself, was mentioned presumably for its forecast of RE-Bench performance, key conclusion: “Our forecast suggests that agent performance on RE-Bench may reach a score of 1—equivalent to the expert baseline reported by Wijk et al. (2024)—around December 2026. We have much more uncertainty about this forecast, and our 95% CI reflects this. It has a span of over 8 years, from August 2025 to May 2033.” But it’s based on just a few data points from about a period of just 1 year, so not super convincing.)
So the citation is an unreleased paper! That unreleased paper may make a splash, since (assuming this 7-month-doubling trend is not merely 1-2 years old) it strongly implies we really will find good solutions for turning LLMs agentic fairly soon.
The 7-month doubling trend we measured actually goes back to GPT-2 in 2019. Since 2024, the trend has been faster, doubling roughly every 3-4 months depending on how you measure, but we only have six 2024-25 models so error bars are wide and it’s really unclear which trend will be predictive of the future.
Okay I got trapped in a Walgreens and read more of this, found something compelling. Emphasis mine:
This is presented without much fanfare but feels like a crux to me. After all, the whole paper is predicated on the idea that AI will be able to effectively replace the work of human researchers. The paragraph has a footnote (44), which reads:
So the citation is an unreleased paper! That unreleased paper may make a splash, since (assuming this 7-month-doubling trend is not merely 1-2 years old) it strongly implies we really will find good solutions for turning LLMs agentic fairly soon.
(The second paper cited, only a couple weeks old itself, was mentioned presumably for its forecast of RE-Bench performance, key conclusion: “Our forecast suggests that agent performance on RE-Bench may reach a score of 1—equivalent to the expert baseline reported by Wijk et al. (2024)—around December 2026. We have much more uncertainty about this forecast, and our 95% CI reflects this. It has a span of over 8 years, from August 2025 to May 2033.” But it’s based on just a few data points from about a period of just 1 year, so not super convincing.)
The 7-month doubling trend we measured actually goes back to GPT-2 in 2019. Since 2024, the trend has been faster, doubling roughly every 3-4 months depending on how you measure, but we only have six 2024-25 models so error bars are wide and it’s really unclear which trend will be predictive of the future.
FYI: the paper is now out.
See also the LW linkpost: METR: Measuring AI Ability to Complete Long Tasks, and a summary on Twitter.
(IMO this is a really cool paper — very grateful to @Thomas Kwa et al. I’m looking forward to digging into the details.)
I’ve been confused what people are talking about when they say “trend lines indicate AGI by 2027”—seems like it’s basically this?
More than just this. OP actually documents it pretty well, see here.