I would find this post much more useful to engage with if you more concretely described the type of tasks that you think AIs will remain bad and gave a bunch of examples. (Or at least made an argument for why it is hard to construct examples if that is your perspective.)
I think you’re pointing to a category like “tasks that require lots of serial reasoning for humans, e.g., hard math problems particularly ones where the output should be a proof”. But, I find this confusing, because we’ve pretty clearly seen huge progress on this in the last year such that it seems like the naive extrapolation would imply that systems are much better at this by the end of the year.
Already AIs seem to be not that much worse at tricky serial reasoning than smart humans:
My sense is that AIs are pretty competitive at 8th grade competition math problems with numerical answers and that are relatively shorter. As in, they aren’t much worse than the best 8th graders at AIME or similar.
At proofs, the AIs are worse, but showing some signs of life.
On logic/reasoning puzzles the AIs are already pretty good and seems to be getting better rapidly on any specific type of task as far as I could tell.
It would be even better if you pointed to some particular benchmark and made predictions.
What are some of the most impressive things you do expect to see AI do, such that if you didn’t see them within 3 or 5 years, you’d majorly update about time to the type of AGI that might kill everyone?
Consider tasks that quite good software engineers (maybe top 40% at Jane Street) typically do in 8 hours without substantial prior context on that exact task. (As in, 8 hour median completion time.) Now, we’ll aim to sample these tasks such that the distribution and characteristics of these tasks are close to the distribution of work tasks in actual software engineering jobs (we probably can’t get that close because of the limited context constraint, but we’ll try).
In short timelines, I expect AIs will be able to succeed at these tasks 70% of the time within 3-5 years and if they didn’t, I would update toward longer timelines. (This is potentially using huge amounts of inference compute and using strategies that substantially differ from how humans do these tasks.)
The quantitative update would depend on how far AIs are from being able to accomplish this. If AIs were quite far (e.g., at 2 hours on this metric which is pretty close to where they are now) and the trend on horizon length indicated N years until 64 hours, I would update to something like 3 N as my median for AGI.
(I think a reasonable interpretation of the current trend indicates like 4 month doubling times. We’re currently at like a bit less than 1 hour for this metric I think, though maybe more like 30 min? Maybe you need to get to 64 hours until stuff feels pretty close to getting crazy. So, this suggests 2.3 year, though I expect longer in practice. My actual median for “AGI” in a strong sense is like 7 years, so 3x longer than this.)
Edit: Note that I’m not responding to “most impressive”, just trying to operationalize something that would make me update.
Thanks… but wait, this is among the most impressive things you expect to see? (You know more than I do about that distribution of tasks, so you could justifiably find it more impressive than I do.)
No, sorry I was mostly focused on “such that if you didn’t see them within 3 or 5 years, you’d majorly update about time to the type of AGI that might kill everyone”. I didn’t actually pick up on “most impressive” and actually tried to focus on something that occurs substantially before things get crazy.
Most impressive would probably be stuff like “automate all of AI R&D and greatly accelerate the pace of research at AI companies”. (This seems about 35% likely to me within 5 years, so I’d update by at least that much.) But this hardly seems that interesting? I think we can agree that once the AIs are automating whole companies stuff is very near.
Ok. So I take it you’re very impressed with the difficulty of the research that is going on in AI R&D.
we can agree that once the AIs are automating whole companies stuff
(FWIW I don’t agree with that; I don’t know what companies are up to, some of them might not be doing much difficult stuff and/or the managers might not be able to or care to tell the difference.)
I mean, I don’t think AI R&D is a particularly hard field persay, but I do think it involves lots of tricky stuff and isn’t much easier than automating some other plausibly-important-to-takeover field (e.g., robotics). (I could imagine that the AIs have a harder time automating philosophy even if they were trying to work on this, but it’s more confusing to reason about because human work on this is so dysfunctional.) The main reason I focused on AI R&D is that I think it is much more likely to be fully automated first and seems like it is probably fully automated prior to AI takeover.
Ok, I think I see what you’re saying. To check part of my understanding: when you say “AI R&D is fully automated”, I think you mean something like:
Most major AI companies have fired almost all of their SWEs. They still have staff to physically build datacenters, do business, etc.; and they have a few overseers / coordinators / strategizers of the fleet of AI R&D research gippities; but the overseers are acknowledged to basically not be doing much, and not clearly be even helping; and the overall output of the research group is “as good or better” than in 2025--measured… somehow.
I could imagine the capability occurring but not playing out that way, because the SWEs won’t necessarily be fired even after becoming useless—so it won’t be completely obvious from the outside. But this is a sociological point about when companies fire people, not a prediction about AI capabilities.
SWEs won’t necessarily be fired even after becoming useless
I’m actually surprised at how eager/willing big tech is to fire SWEs once they’re sure they won’t be economically valuable. I think a lot of priors for them being stable come from the ZIRP era. Now, these companies have quite frequent layoffs, silent layoffs, and performance firings. Companies becoming leaner will be a good litmus test for a lot of these claims.
So, I agree that there has been substantial progress in the past year, hence the post title. But I think if you naively extrapolate that rate of progress, you get around 15 years.
The problem with the three examples you’ve mentioned is again that they’re all comparing human cognitive work across a short amount of time with AI performance. I think the relevant scale doesn’t go from 5th grade performance over 8th grade performance to university-level performance or whatever, but from “what a smart human can do in 5 minutes” over “what a smart human can do in an hour” over “what a smart human can do in a day”, and so on.
I don’t know if there is an existing benchmark that measures anything like this. (I agree that more concrete examples would improve the post, fwiw.)
And then a separate problem is that math problems are in in the easiest category from §3.1 (as are essentially all benchmarks).
I think if you look at “horizon length”—at what task duration (in terms of human completion time) do the AIs get the task right 50% of the time—the trends will indicate doubling times of maybe 4 months (though 6 months is plausible). Let’s say 6 months more conservatively. I think AIs are at like 30 minutes on math? And 1 hour on software engineering. It’s a bit unclear, but let’s go with that. Then, to get to 64 hours on math, we’d need 7 doublings = 3.5 years. So, I think the naive trend extrapolation is much faster than you think? (And this estimate strikes me as conservative at least for math IMO.)
FWIW, this seems like an overestimate to me. Maybe o3 is better than other things, but I definitely can’t get equivalents of 1-hour chunks out of language models, unless it happens to be an extremely boilerplate-heavy step. My guess is more like 15-minutes, and for debugging (which in my experience is close to most software-engineering time), more like 5-10 minutes.
The question of context might be important, see here. I wouldn’t find 15 minutes that surprising for ~50% success rate, but I’ve seen numbers more like 1.5 hours. I thought this was likely to be an overestimate so I went down to 1 hour, but more like 15-30 minutes is also plausible.
Keep in mind that I’m talking about agent scaffolds here.
Keep in mind that I’m talking about agent scaffolds here.
Yeah, I have failed to get any value out of agent scaffolds, and I don’t think I know anyone else who has so far. If anyone has gotten more value out of them than just the Cursor chat, I would love to see how they do it!
All things like Cursor composer and codebuff and other scaffolds have been worse than useless for me (though I haven’t tried it again after o3-mini, which maybe made a difference, it’s been on my to-do list to give it another try).
FYI I do find that aider using a mixed routing between r1 and o3-mini-high as the architect model with sonnet as the editor model to be slightly better than cursor/windsurf etc.
Or for minimal setup, this is what is ranking the highest on aider-polyglot test: aider --architect --model openrouter/deepseek/deepseek-r1 --editor-model sonnet
(I don’t expect o3-mini is a much better agent than 3.5 sonnet new out of the box, but probably a hybrid scaffold with o3 + 3.5 sonnet will be substantially better than 3.5 sonnet. Just o3 might also be very good. Putting aside cost, I think o1 is usually better than o3-mini on open ended programing agency tasks I think.)
I don’t think a doubling every 4 or 6 months is plausible. I don’t think a doubling on any fixed time is plausible because I don’t think overall progress will be exponential. I think you could have exponential progress on thought generation, but this won’t yield exponential progress on performance. That’s what I was trying to get at with this paragraph:
My hot take is that the graphics I opened the post with were basically correct in modeling thought generation. Perhaps you could argue that progress wasn’t quite as fast as the most extreme versions predicted, but LLMs did go from subhuman to superhuman thought generation in a few years, so that’s pretty fast. But intelligence isn’t a singular capability; it’s two capabilities a phenomenon better modeled as two capabilities, and increasing just one of them happens to have sub-linear returns on overall performance.
So far (as measured by the 7card puzzle, which It think is a fair data point) I think we went from ‘no sequential reasoning whatsoever’ to ‘attempted sequential reasoning but basically failed’ (Jun13 update) to now being able to do genuine sequential reasoning for the first time. And if you look at how DeepSeek does it, to me this looks like the kind of thing where I expect difficulty to grow exponentially with argument length. (Based on stuff like it constantly having to go back and double checking even when it got something right.)
What I’d expect from this is not a doubling every N months, but perhaps an ability to reliably do one more step every N months. I think this translates into more above-constant returns on the “horizon length” scale—because I think humans need more than 2x time for 2x steps—but not exponential returns.
I expect difficulty to grow exponentially with argument length. (Based on stuff like it constantly having to go back and double checking even when it got something right.)
Training of DeepSeek-R1 doesn’t seem to do anything at all to incentivize shorter reasoning traces, so it’s just rechecking again and again because why not. Like if you are taking an important 3 hour written test, and you are done in 1 hour, it’s prudent to spend the remaining 2 hours obsessively verifying everything.
As I said in my top level comment, I don’t see a reason to think that once the issue is identified as they key barrier, work on addressing it would be so slow.
I would find this post much more useful to engage with if you more concretely described the type of tasks that you think AIs will remain bad and gave a bunch of examples. (Or at least made an argument for why it is hard to construct examples if that is your perspective.)
I think you’re pointing to a category like “tasks that require lots of serial reasoning for humans, e.g., hard math problems particularly ones where the output should be a proof”. But, I find this confusing, because we’ve pretty clearly seen huge progress on this in the last year such that it seems like the naive extrapolation would imply that systems are much better at this by the end of the year.
Already AIs seem to be not that much worse at tricky serial reasoning than smart humans:
My sense is that AIs are pretty competitive at 8th grade competition math problems with numerical answers and that are relatively shorter. As in, they aren’t much worse than the best 8th graders at AIME or similar.
At proofs, the AIs are worse, but showing some signs of life.
On logic/reasoning puzzles the AIs are already pretty good and seems to be getting better rapidly on any specific type of task as far as I could tell.
It would be even better if you pointed to some particular benchmark and made predictions.
What are some of the most impressive things you do expect to see AI do, such that if you didn’t see them within 3 or 5 years, you’d majorly update about time to the type of AGI that might kill everyone?
Consider tasks that quite good software engineers (maybe top 40% at Jane Street) typically do in 8 hours without substantial prior context on that exact task. (As in, 8 hour median completion time.) Now, we’ll aim to sample these tasks such that the distribution and characteristics of these tasks are close to the distribution of work tasks in actual software engineering jobs (we probably can’t get that close because of the limited context constraint, but we’ll try).
In short timelines, I expect AIs will be able to succeed at these tasks 70% of the time within 3-5 years and if they didn’t, I would update toward longer timelines. (This is potentially using huge amounts of inference compute and using strategies that substantially differ from how humans do these tasks.)
The quantitative update would depend on how far AIs are from being able to accomplish this. If AIs were quite far (e.g., at 2 hours on this metric which is pretty close to where they are now) and the trend on horizon length indicated N years until 64 hours, I would update to something like 3 N as my median for AGI.
(I think a reasonable interpretation of the current trend indicates like 4 month doubling times. We’re currently at like a bit less than 1 hour for this metric I think, though maybe more like 30 min? Maybe you need to get to 64 hours until stuff feels pretty close to getting crazy. So, this suggests 2.3 year, though I expect longer in practice. My actual median for “AGI” in a strong sense is like 7 years, so 3x longer than this.)
Edit: Note that I’m not responding to “most impressive”, just trying to operationalize something that would make me update.
Thanks… but wait, this is among the most impressive things you expect to see? (You know more than I do about that distribution of tasks, so you could justifiably find it more impressive than I do.)
No, sorry I was mostly focused on “such that if you didn’t see them within 3 or 5 years, you’d majorly update about time to the type of AGI that might kill everyone”. I didn’t actually pick up on “most impressive” and actually tried to focus on something that occurs substantially before things get crazy.
Most impressive would probably be stuff like “automate all of AI R&D and greatly accelerate the pace of research at AI companies”. (This seems about 35% likely to me within 5 years, so I’d update by at least that much.) But this hardly seems that interesting? I think we can agree that once the AIs are automating whole companies stuff is very near.
Ok. So I take it you’re very impressed with the difficulty of the research that is going on in AI R&D.
(FWIW I don’t agree with that; I don’t know what companies are up to, some of them might not be doing much difficult stuff and/or the managers might not be able to or care to tell the difference.)
I mean, I don’t think AI R&D is a particularly hard field persay, but I do think it involves lots of tricky stuff and isn’t much easier than automating some other plausibly-important-to-takeover field (e.g., robotics). (I could imagine that the AIs have a harder time automating philosophy even if they were trying to work on this, but it’s more confusing to reason about because human work on this is so dysfunctional.) The main reason I focused on AI R&D is that I think it is much more likely to be fully automated first and seems like it is probably fully automated prior to AI takeover.
Ok, I think I see what you’re saying. To check part of my understanding: when you say “AI R&D is fully automated”, I think you mean something like:
I could imagine the capability occurring but not playing out that way, because the SWEs won’t necessarily be fired even after becoming useless—so it won’t be completely obvious from the outside. But this is a sociological point about when companies fire people, not a prediction about AI capabilities.
I’m actually surprised at how eager/willing big tech is to fire SWEs once they’re sure they won’t be economically valuable. I think a lot of priors for them being stable come from the ZIRP era. Now, these companies have quite frequent layoffs, silent layoffs, and performance firings. Companies becoming leaner will be a good litmus test for a lot of these claims.
So, I agree that there has been substantial progress in the past year, hence the post title. But I think if you naively extrapolate that rate of progress, you get around 15 years.
The problem with the three examples you’ve mentioned is again that they’re all comparing human cognitive work across a short amount of time with AI performance. I think the relevant scale doesn’t go from 5th grade performance over 8th grade performance to university-level performance or whatever, but from “what a smart human can do in 5 minutes” over “what a smart human can do in an hour” over “what a smart human can do in a day”, and so on.
I don’t know if there is an existing benchmark that measures anything like this. (I agree that more concrete examples would improve the post, fwiw.)
And then a separate problem is that math problems are in in the easiest category from §3.1 (as are essentially all benchmarks).
I think if you look at “horizon length”—at what task duration (in terms of human completion time) do the AIs get the task right 50% of the time—the trends will indicate doubling times of maybe 4 months (though 6 months is plausible). Let’s say 6 months more conservatively. I think AIs are at like 30 minutes on math? And 1 hour on software engineering. It’s a bit unclear, but let’s go with that. Then, to get to 64 hours on math, we’d need 7 doublings = 3.5 years. So, I think the naive trend extrapolation is much faster than you think? (And this estimate strikes me as conservative at least for math IMO.)
FWIW, this seems like an overestimate to me. Maybe o3 is better than other things, but I definitely can’t get equivalents of 1-hour chunks out of language models, unless it happens to be an extremely boilerplate-heavy step. My guess is more like 15-minutes, and for debugging (which in my experience is close to most software-engineering time), more like 5-10 minutes.
The question of context might be important, see here. I wouldn’t find 15 minutes that surprising for ~50% success rate, but I’ve seen numbers more like 1.5 hours. I thought this was likely to be an overestimate so I went down to 1 hour, but more like 15-30 minutes is also plausible.
Keep in mind that I’m talking about agent scaffolds here.
Yeah, I have failed to get any value out of agent scaffolds, and I don’t think I know anyone else who has so far. If anyone has gotten more value out of them than just the Cursor chat, I would love to see how they do it!
All things like Cursor composer and codebuff and other scaffolds have been worse than useless for me (though I haven’t tried it again after o3-mini, which maybe made a difference, it’s been on my to-do list to give it another try).
FYI I do find that aider using a mixed routing between r1 and o3-mini-high as the architect model with sonnet as the editor model to be slightly better than cursor/windsurf etc.
Or for minimal setup, this is what is ranking the highest on aider-polyglot test:
aider --architect --model openrouter/deepseek/deepseek-r1 --editor-model sonnet
(I don’t expect o3-mini is a much better agent than 3.5 sonnet new out of the box, but probably a hybrid scaffold with o3 + 3.5 sonnet will be substantially better than 3.5 sonnet. Just o3 might also be very good. Putting aside cost, I think o1 is usually better than o3-mini on open ended programing agency tasks I think.)
I don’t think a doubling every 4 or 6 months is plausible. I don’t think a doubling on any fixed time is plausible because I don’t think overall progress will be exponential. I think you could have exponential progress on thought generation, but this won’t yield exponential progress on performance. That’s what I was trying to get at with this paragraph:
So far (as measured by the 7card puzzle, which It think is a fair data point) I think we went from ‘no sequential reasoning whatsoever’ to ‘attempted sequential reasoning but basically failed’ (Jun13 update) to now being able to do genuine sequential reasoning for the first time. And if you look at how DeepSeek does it, to me this looks like the kind of thing where I expect difficulty to grow exponentially with argument length. (Based on stuff like it constantly having to go back and double checking even when it got something right.)
What I’d expect from this is not a doubling every N months, but perhaps an ability to reliably do one more step every N months. I think this translates into more above-constant returns on the “horizon length” scale—because I think humans need more than 2x time for 2x steps—but not exponential returns.
Training of DeepSeek-R1 doesn’t seem to do anything at all to incentivize shorter reasoning traces, so it’s just rechecking again and again because why not. Like if you are taking an important 3 hour written test, and you are done in 1 hour, it’s prudent to spend the remaining 2 hours obsessively verifying everything.
As I said in my top level comment, I don’t see a reason to think that once the issue is identified as they key barrier, work on addressing it would be so slow.