https://www.elilifland.com/. You can give me anonymous feedback here. I often change my mind and don’t necessarily endorse past writings.
elifland
SOTA models will take thousands of dollars per month to run at maximum intelligence and thoughtfulness, and most people won’t pay for that.
So do you disagree with the capability description of Agent-3-mini then?
We say: “It blows the other AIs out of the water. Agent-3-mini is less capable than Agent-3, but 10x cheaper, and still better than the typical OpenBrain employee.” And presumably at remote work jobs besides AI research it is more likely the median employee or lower, but still quite capable. So while of course maximum performance will require very high spend, we are projecting in July that quite high capabilities are available somewhat cheaply.
“The Department of Defense considers this a critical advantage in cyberwarfare, and AI moves from #5 on the administration’s priority list to #2.” they’re talking about the DoD specifically, not the Trump administration more broadly. I actively disbelieve that AI will be the #2 priority for Trump in Feb ’27; it seems plausible for it to be the #2 priority in the DoD. I do believe there are a lot of NatSec people who are going to be extremely worried about cyber-warfare (and other threats) from AI.
This is supposed to be referring to the Trump administration more broadly. I’m curious why you’re so skeptical of it being #2 once the AIs are at the level of top human hackers.
Responses to some of your points:
There is no particular reason to endorse the particular set of gaps chosen. The most prominent gap that I’ve seen discussed, the ability of LLMs to come up with new ideas or paradigms, wasn’t included.
This skill doesn’t seem that necessary for superhuman coding, but separately I think that AIs can already do this to some extent and it’s unclear that it will lag much behind other skills.
“benchmarks-and-gaps” has historically proven to be an unreliable way of forecasting AI development. The problem is that human intuitions about what capabilities are required for specific tasks aren’t very good, and so more “gaps” are discovered once the original gaps have been passed.
I think with previous benchmarks it was generally clearer that solving them would be nowhere near what is needed for superhuman coding or AGI. But I agree that we should notice similar skulls with e.g. solving chess being considered AGI-complete.
“AI 2027” uses an implausible forecast of compute/algorithm improvement past 2028. It assumes that each continues exponential progress, but at half the rate (so 2.35x compute/year, and 1.5x algorithmic improvement/year).
Seems plausible, I implemented these as quick guesses, though this wouldn’t effect the mode or median forecasts much. I agree that we should have a long tail due to considerations like this, e.g. my 90th percentile is >2050.
If current growth rates can deliver superhuman coding capabilities by 2027, we might actually see it happen. However, if those same capabilities would require until 2028, then on some plausible financial models we wouldn’t see AGI until the mid-2030′s or later.
I’m very skeptical that 2028 with current growth rates would be pushed all the way back to mid-2030s and that the cliff will be so steep. My intuitions are more continuous here. If AGI is close in 2027 I think that will mean increased revenue and continued investment, even if the rate slows down some.
Thanks for these detailed comments! I’ll aim to respond to some of the meat of your post within a few days latest, but real quick regarding the top portion:
I find the decision to brand the forecast as “AI 2027” very odd. The authors do not in fact believe this; they explicitly give 2028, 2030, or 2033 for their median dates for a superhuman coder.
The point of this project was presumably to warn about a possible outcome; by the authors’ own beliefs, their warning will be falsified immediately before it is needed.
Adding some more context: each of the timelines forecasts authors’ modal superhuman coder year is roughly 2027. The FutureSearch forecasters who have a 2033 median aren’t authors on the scenario itself (but neither is Nikola with the 2028 median). Of the AI 2027 authors, all have a modal year of roughly 2027 and give at least ~20% to getting it by 2027. Daniel, the lead author, has a median of early 2028.
IMO it seems reasonable to portray 2027 as the arrival year of superhuman coders, given the above. It’s not clear whether the median or modal year is better here, conditional on having substantial probability by the modal year (i.e. each of us has >=20% by 2027, Daniel has nearly 50%).
To be transparent though, we originally had it at 2027 because that was Daniel’s median year when we started the project. We decided against changing it when he lengthened his median because (a) it would have been a bunch of work and we’d already spent over a year on the project and (b) as I said above, it seemed roughly as justified as 2028 anyway from an epistemic perspective.
Overall though I sympathize with the concern that we will lose a bunch of credibility if we don’t get superhuman coders by 2027. Seems plausible that we should have lengthened story despite the reasoning above.
When presenting predictions, forecasters always face tradeoffs regarding how much confidence to present. Confident, precise forecasting attracts attempts and motivates action; adding many concrete details produces a compelling story, stimulating discussion; this also involves falsifiable predictions. Emphasizing uncertainty avoids losing credibility when some parts of story inevitably fail; prevents overconfidence; and encourages more robust strategies that can work across a range of outcomes. But I can’t think of any reason to give a confident, high precision story that you don’t even believe in!
I’d be curious to hear more about what made you perceive our scenario as confident. We included caveats signaling uncertainty in a bunch of places, for example in “Why is it valuable?” and several expendables and footnotes. Interestingly, this popular YouTuber made a quip that it seemed like we were adding tons of caveats everywhere,
However, the authors of AI 2027 predict pretty radical superintelligence before 2030, which does not seem to be justified by the plot. Arguably, since the plot is focused on software engineering tasks, the most relevant comparison is actually their prediction for human level software engineers, which is I believe is around 2026-2028 (clearly inconsistent with the plot).
Our rationale for why we extend the trend in the way that we do can be found in our timelines forecast. In short, we adjust for (a) the possible trend speedup to a ~4 month doubling time as in the 2024-2025 trend (b) the possibility of further superexponentiality (c) intermediate speedups from AIs that aren’t yet superhuman coders. Fair if you disagree, but we do explain how we expect things to deviate from the plot you included.
Forecasting time to automated superhuman coders [AI 2027 Timelines Forecast]
Thanks, this should be fixed now.
My median for superhuman coder is roughly 2030, and yeah for TEDAI roughly 2031. We included our all-things-considered views in a table in the timelines forecast, which are a bit longer than our within-model views.
Ultimately we circulated the benchmarks-and-gaps figure as the primary one because it’s close to our all-things-considered views and we didn’t have time to make a similar figure for our all-things-considered forecast. Perhaps this was a mistake as per @Max Harms’s point of appearing to have faster timelines than we do (though Daniel’s is a bit faster than my benchmarks-and-gaps distribution with a median in early 2028 instead of late 2028).
[Responding to a related point from the OP] An important takeaway from this is that we should expect people to look back on this scenario and think it was too fast (because it didn’t account for [unlikely event that happened anyway]). I don’t know any way around this; it’s largely going to be a result of people not being prediction-literate. Still, best to write down the prediction of the backlash in advance, and I wish AI 2027 had done this more visibly. (It’s tucked away in a few places, such as footnote #1.)
Yeah seems plausible we should have signaled this more strongly, though it may have been tough to do so without undermining our own credibility too much in the eyes of many readers, given the norms around caveats are quite different in non-rationalist spaces. It being footnote 1 is already decently prominent.
AI 2027: What Superintelligence Looks Like
I agree with habryka that the current speedup is probably substantially less than 3x.
However, worth keeping in mind that if it were 3x for engineering the overall AI progress speedup would be substantially lower, due to (a) non-engineering activities having a lower speedup, (b) compute bottlenecks, (c) half of the default pace of progress coming from compute.
My null hypothesis would be that programmer productivity is increasing exponentially and has been for ~2 years, and this is already being taken into account in the curves, and without this effect you would see a slower (though imo not massively slower) exponential
Exponential growth alone doesn’t imply a significant effect here, if the current absolute speedup is low.
I found some prior relevant work and tagged them in https://www.lesswrong.com/tag/successor-alignment. I found the top few comments on https://www.lesswrong.com/posts/axKWaxjc2CHH5gGyN/ai-will-not-want-to-self-improve#comments and https://www.lesswrong.com/posts/wZAa9fHZfR6zxtdNx/agi-systems-and-humans-will-both-need-to-solve-the-alignment#comments helpful.
edit: another effect to keep in mind is that capabilities research may be harder to sandbag on because of more clear metrics.
Wanted to write a more thoughtful reply to this, but basically yes, my best guess is that the benefits of informing the world are in expectation bigger than the negatives from acceleration. A potentially important background views is that I think takeoff speeds matter more than timelines, and it’s unclear to me how having FrontierMath affects takeoff speeds.
I wasn’t thinking much about the optics, but I’d guess that’s not a large effect. I agree that Epoch made a mistake here though and this is a negative.
I could imagine changing my mind somewhat easily,.
I feel like I might be missing something, but conditional on scheming isn’t it differentially useful for safety because by default scheming AIs would be more likely to sandbag on safety research than capabilities research?
Yes, that answer matches my understanding of the concern. If the vast majority of the dataset was private to Epoch, OpenAI they could occasionally submit their solution (probably via API) to Epoch to grade, but wouldn’t be able to use the dataset with high frequency as evaluation in many experiments.
This is assuming that companies won’t fish out the data from API logs anyway, which the OP asserts but I think is unclear.
Also if they have access to the mathematicians’ reasoning in addition to final answers, this could potentially be valuable without directly training on it (e.g. maybe they could use to evaluate process-based grading approaches).
(FWIW I’m explaining the negatives, but I disagree with the comment I’m expanding on regarding the sign of Frontier Math, seems positive EV to me despite the concerns)
Sorry, fixed
Not representative of motivations for all people for all types of evals, but https://www.openphilanthropy.org/rfp-llm-benchmarks/, https://www.lesswrong.com/posts/7qGxm2mgafEbtYHBf/survey-on-the-acceleration-risks-of-our-new-rfps-to-study, https://docs.google.com/document/d/1UwiHYIxgDFnl_ydeuUq0gYOqvzdbNiDpjZ39FEgUAuQ/edit, and some posts in https://www.lesswrong.com/tag/ai-evaluations seem relevant.
Superforecasters can beat domain experts, as shown in Phil Tetlock’s work comparing superforecasters to intelligence analysts.
This isn’t accurate, see this post: especially (3a), (3b), and https://docs.google.com/document/d/1ZEEaVP_HVSwyz8VApYJij5RjEiw3mI7d-j6vWAKaGQ8/edit?tab=t.0#heading=h.mma60cenrfmh Goldstein et al (2015)
Predict 2025 AI capabilities (by Sunday)
Do you think that cyber professionals would take multiple hours to do the tasks with 20-40 min first-solve times? I’m intuitively skeptical.
Yes, that would be my guess, medium confidence.
One component of my skepticism is that someone told me that the participants in these competitions are less capable than actual cyber professionals, because the actual professionals have better things to do than enter competitions. I have no idea how big that selection effect is, but it at least provides some countervailing force against the selection effect you’re describing.
I’m skeptical of your skepticism. Not knowing basically anything about the CTF scene but using the competitive programming scene as an example, I think the median competitor is much more capable than the median software engineering professional, not less. People like competing at things they’re good at.
I’m not sure I understand your point. I’d guess that if this happened it would be pretty easy for the capabilities researcher to iron out and prevent in the future.