Engineer at METR.
Previously: Vivek Hebbar’s team at MIRI → Adrià Garriga-Alonso onvarious empirical alignment projects → METR.
I have signed no contracts or agreements whose existence I cannot mention.
Engineer at METR.
Previously: Vivek Hebbar’s team at MIRI → Adrià Garriga-Alonso onvarious empirical alignment projects → METR.
I have signed no contracts or agreements whose existence I cannot mention.
What is required for AI to provide net speedup to a software engineering project, when humans write higher quality code than AIs? It depends how it’s used.
Cursor regime
In this regime, similar to how I use Cursor agent mode, the human has to read every line of code the AI generates, so we can write:
Where
is the time for the human to write the code, either from scratch or after rejecting an AI suggestion
is the time for the AI to generate the code in tokens per second.
is the time for the human to check the code, accept or reject it, and make any minor revisions, in order to bring the code quality and probability of bugs equal with human-written code ()
is the fraction of AI suggestions that are rejected entirely.
Note this neglects other factors like code review time, code quality, bugs that aren’t caught by the human, or enabling things the human can’t do.
Autonomous regime
In this regime the AI is reliable enough that the human doesn’t check all the code for bugs, and instead eats the chance of costly bugs entering the codebase.
is the added probability of a bug from the AI agent compared to a human
is the expected cost of a bug, including revenue loss, compromising other projects, and time required to fix the bug. This can be higher than
Verifiable regime
If task success is cheaply verifiable e.g. if comprehensive tests are already written or other AIs can verify the code, then bugs are impossible except through reward hacking.
Here is lower than in the other regimes because the verifier can help the generator AI understand the task, and can be faster too because you can parallelize.
Although the AI is always faster than writing the code than the human, overall speedup is subject to Amdahl’s law, i.e. limited by the slowest component. If AI generation is only 3x as fast as the human per line of code, speedup will never be faster than 3x. Even when AI mistake rate is low enough that we move to the autonomous regime, we still have and , both of which can be significant fractions of currently, especially for expert humans who are fast at writing code and projects that are few lines of code. Therefore I expect >5x speedups to overall projects only when (a) AIs write higher-quality code than humans, or (b) tasks are cheaply verifiable and reward hacking is rare, or (c) both and are much faster than in my current work.
I made a simple Desmos model here.
Why not require model organisms with known ground truth and see if the methods accurately reveal them, like in the paper? From the abstract of that paper:
Additionally, we argue for scientists using complex non-linear dynamical systems with known ground truth, such as the microprocessor as a validation platform for time-series and structure discovery methods.
This reduces the problem from covering all sources of doubt to making a sufficiently realistic model organism. This was our idea with InterpBench, and I still find it plausible that with better execution one could iterate on it (or the probe, crosscoder, etc. equivalent) and make interpretability progress.
Author here. When constructing this paper, we needed an interpretable metric (time horizon), but this is not very data-efficient. We basically made the fewest tasks we could to get acceptable error bars, because high-quality task creation from scratch is very expensive. (We already spent over $150k baselining tasks, and more on the bounty and baselining infra.) Therefore we should expect that restricting to only 32 of the 170 tasks in the paper makes the error bars much wider; it roughly increases the error bars by a factor of sqrt(170/32) = 2.3.
Now if these tasks were log-uniformly distributed from 1 second to 16 hours, we would still be able to compare these results to the rest of our dataset, but it’s worse than that: all fully_private tasks are too difficult for models before GPT-4, so removing SWAA requires restricting the x axis to the 2 years since GPT-4. This is 1⁄3 of our x axis range, so it makes error bars on the trendline slope 3 times larger. Combined with the previous effect the error bars become 6.9 times larger than the main paper! So the analysis in this post, although it uses higher-quality data, basically throws away all the statistical power of the paper. I wish I had 170 high-quality private tasks but there was simply not enough time to construct them. Likewise, I wish we had baselines for all tasks, but sadly we will probably move to a different safety case framework before we get them.
Though SWAA tasks have their own problems (eg many of them were written by Claude), they were developed in-house and are private, and it’s unlikely that GPT-2 and GPT-3 would have a horizon like 10x different on a different benchmark. So for this check we really should not exclude them.
privacy_level | num_tasks |
fully_private | 32 |
public_problem | 22 |
easy_to_memorize | 18 |
public_solution | 17 |
semi_private | 2 |
When we include SWAA, the story is not too different from the original paper: doubling roughly 8 months. Note how large the error bars are.
With only fully_private tasks and SWAA and RE-Bench, the extrapolated date of 1-month AI looks similar to the main paper, but the error bars are much larger; this is entirely driven by the small number of tasks (green bar).
For comparison, the main paper extrapolation
When I exclude SWAA and RE-Bench too, the script refuses to even run, because sometimes the time horizon slope is negative, but when I suppress those errors and take the 2024-2025 trend we get 80% CI of early 2026 to mid-2047! This is consistent with the main result but pretty uninformative.
You can see that with only fully_private tasks (ignore the tasks under 1 minute; these are SWAA), it’s hard to even tell whether longer tasks are more difficult in the <1 hour range (tiles plot). As such we should be suspicious of whether the time horizon metric works at all with so few tasks.
I ran the horizon length graph with pass@8 instead, and the increase between GPT-4 and o1 seems to be slightly smaller (could be noise), and also Claude 3.7 does worse than o1. This means the doubling rate for pass@8 may be slightly slower than for pass@1. However, if horizon length increase since 2023 were only due to RL, the improvement from pass@8 would be barely half as much as the improvement in pass@1. It’s faster, which could be due to some combination of the following:
o1 has a better base model than GPT-4
HCAST is an example of “emergent capabilities on long-horizon tasks”, unlike their non-agentic tasks
RL helps more on HCAST skills than math or non-agentic coding
RL helps more on larger models
Some kind of nonlinearity in the relevant metrics
There are various problems with this graph (e.g. to follow the same methodology we should filter for models that are frontier on pass@8, not frontier on pass@1) but this was meant to be a quick and dirty check.
GPT-4 horizon | o1 horizon | Ratio | |
Pass@8 | 24 min | 151 min | 6.3x |
Pass@1 | 5 min | 39 min | 7.8x |
Agree, I’m pretty confused about this discrepancy. I can’t rule out that it’s just the “RL can enable emergent capabilities” point.
The dates used in our regression are the dates models were publicly released, not the dates we benchmarked them. If we use the latter dates, or the dates they were announced, I agree they would be more arbitrary.
Also, there is lots of noise in a time horizon measurement and it only displays any sort of pattern because we measured over many orders of magnitude and years. It’s not very meaningful to extrapolate from just 2 data points; there are many reasons one datapoint could randomly change by a couple of months or factor of 2 in time horizon.
Release schedules could be altered
A model could be overfit to our dataset
One model could play less well with our elicitation/scaffolding
One company could be barely at the frontier, and release a slightly-better model right before the leading company releases a much-better model.
All of these factors are averaged out if you look at more than 2 models. So I prefer to see each model as evidence of whether the trend is accelerating or slowing down over the last 1-2 years, rather than an individual model being very meaningful.
o3 and o4-mini solve more than zero of the >1hr tasks that claude 3.7 got ~zero on, including some >4hr tasks that no previous models we tested have done well on, so it’s not that models hit a wall at 1-4 hours. My guess is that the tasks they have been trained on are just more similar to HCAST tasks than RE-Bench tasks, though there are other possibilities.
Other metrics also point to drone-dominated and C&C dominated war. E.g. towed artillery is too vulnerable to counterbattery fire, and modern mobile artillery like CAESAR must use “shoot and scoot” tactics—it can fire 6 shells within two minutes of stopping and vacate before its last shell lands. But now drones attack them while moving too.
Yes. RL will at least be more applicable to well-defined tasks. Some intuitions:
In my everyday, the gap between well-defined task ability and working with the METR codebase is growing
4 month doubling time is faster than the rate of progress in most other realistic or unrealistic domains
Recent models really like to reward hack, suggesting that RL can cause some behaviors not relevant to realistic tasks
This trend will break at some point, eg when labs get better at applying RL to realistic tasks, or when RL hits diminishing returns, but I have no idea when
GDM paper: Evaluating the Goal-Directedness of Large Language Models
Tom Everitt, Rohin Shah, and others from GDM attempt to measure “whether LLMs use their capabilities towards their given goal”. Unlike previous work, their measure is not just rescaled task performance—rather, an AI is goal-directed if it uses its capabilities effectively. A model that is not goal-directed when attempting a task will have capabilities but not properly use them. Thus, we can measure goal-directedness by comparing a model’s actual performance to how it should perform if it were using its full capabilities on subtasks.
This kind of measure could in principle be useful to measure:
prerequisites of scheming: maybe a model is not capable of scheming until it reaches some level of goal-directedness, and we can forecast when that happens
elicitation: a normally goal-directed model that is not trying on evals may be underelicited
sandbagging: if an AI unexpectedly has low goal-directedness on evals, this could be a warning sign for sandbagging.
Unfortunately, it’s mostly saturated already—Gemini 2.0 and Claude 3.7 are over 95%. Even GPT-4 gets over 70%.
So until we measure models on tasks where they are currently non-goal-directed, we can’t tell whether future models are more goal-directed than Claude 3.7, precluding use (1). I also feel like scheming will require a qualitatively different kind of goal-directedness more related to instrumental reasoning. This measure seems potentially useful for measuring elicitation and sandbagging, though.
Another limitation is that this can only be measured on tasks that can be cleanly decomposed into subtasks, and whose performance is mathematically predictable from subtask performance. The algorithms they use to estimate (Appendix C) basically sample performance on every subtask and compute task performance from this.
Time horizon of o3 is ~1.5 hours vs Claude 3.7′s 54 minutes, and it’s statistically significant that it’s above the long-term trend. It’s been less than 2 months since the release of Claude 3.7. If time horizon continues doubling every 3.5 months as it has over the last year, we only have another 12 months until time horizon hits 16 hours and we are unable to measure it with HCAST.
My guess is that future model time horizon will double every 3-4 months for well-defined tasks (HCAST, RE-Bench, most automatically scorable tasks) that labs can RL on, while capability on more realistic tasks will follow the long-term 7-month doubling time.
Benchmark Readiness Level
Safety-relevant properties should be ranked on a “Benchmark Readiness Level” (BRL) scale, inspired by NASA’s Technology Readiness Levels. At BRL 4, a benchmark exists; at BRL 6 the benchmark is highly valid; past this point the benchmark becomes increasingly robust against sandbagging. The definitions could look something like this:
BRL | Definition | Example |
1 | Theoretical relevance to x-risk defined | Adversarial competence |
2 | Property operationalized for frontier AIs and ASIs | AI R&D speedup; Misaligned goals |
3 | Behavior (or all parts) observed in artificial settings. Preliminary measurements exist, but may have large methodological flaws. | Reward hacking |
4 | Benchmark developed, but may measure different core skills from the ideal measure | Cyber offense (CyBench) |
5 | Benchmark measures roughly what we want; superhuman range; methodology is documented and reproducible but may have validity concerns. | Software (HCAST++) |
6 | "Production quality" high-validity benchmark. Strongly superhuman range; run on many frontier models; red-teamed for validity; represents all sub-capabilities. Portable implementation. | |
7 | Extensive validity checks against downstream properties; reasonable attempts (e.g. fine-tuning) to detect whether AIs are manipulating/sandbagging their scores. | Knowledge (MMLU with fine-tuning) |
8 | − | |
9 | Benchmark has high validity to real-world contexts beyond lab settings. Can ensure accuracy even when measuring superintelligences with potentially deceptive capabilities. | - |
Here’s a draft list of properties we could apply the BRL to, including capabilities and propensity:
General capabilities
Software ability (HCAST, others)
Knowledge (Humanity’s Last Exam, others)
Sandbagging
Situational awareness
Alignment faking
Sandbagging ability
Monitorability
Steganographic collusion
Neuralese
Faithful CoT
Ability to monitor smarter AIs
Research acceleration
% LoC written by AIs
AI R&D Uplift
Escape risk
Adversarial competence
Self-exfiltration
Cyber offense
Misalignment
Reward hacking
Misaligned goals
Other Dangers
CBRN
Persuasion
Is this doomed? Am I missing anything important?
Some versions of the METR time horizon paper from alternate universes:
Measuring AI Ability to Take Over Small Countries (idea by Caleb Parikh)
Abstract: Many are worried that AI will take over the world, but extrapolation from existing benchmarks suffers from a large distributional shift that makes it difficult to forecast the date of world takeover. We rectify this by constructing a suite of 193 realistic, diverse countries with territory sizes from 0.44 to 17 million km^2. Taking over most countries requires acting over a long time horizon, with the exception of France. Over the last 6 years, the land area that AI can successfully take over with 50% success rate has increased from 0 to 0 km^2, at the rate of 0 km^2 per year (95% CI 0.0-0.0 km^2/year); extrapolation suggests that AI world takeover is unlikely to occur in the near future. To address concerns about the narrowness of our distribution, we also study AI ability to take over small planets and asteroids, and find similar trends.
When Will Worrying About AI Be Automated?
Abstract: Since 2019, the amount of time LW has spent worrying about AI has doubled every seven months, and now constitutes the primary bottleneck to AI safety research. Automation of worrying would be transformative to the research landscape, but worrying includes several complex behaviors, ranging from simple fretting to concern, anxiety, perseveration, and existential dread, and so is difficult to measure. We benchmark the ability of frontier AIs to worry about common topics like disease, romantic rejection, and job security, and find that current frontier models such as Claude 3.7 Sonnet already outperform top humans, especially in existential dread. If these results generalize to worrying about AI risk, AI systems will be capable of autonomously worrying about their own capabilities by the end of this year, allowing us to outsource all our AI concerns to the systems themselves.
Estimating Time Since The Singularity
Early work on the time horizon paper used a hyperbolic fit, which predicted that AGI (AI with an infinite time horizon) was reached last Thursday. [1] We were skeptical at first because the R^2 was extremely low, but recent analysis by Epoch suggested that AI already outperformed humans at a 100-year time horizon by about 2016. We have no choice but to infer that the Singularity has already happened, and therefore the world around us is a simulation. We construct a Monte Carlo estimate over dates since the Singularity and simulator intentions, and find that the simulation will likely be turned off in the next three to six months.
[1]: This is true
Quick list of reasons for me:
I’m averse to attending mass protests myself because they make it harder to think clearly and I usually don’t agree with everything any given movement stands for.
Under my worldview, an unconditional pause is a much harder ask than is required to save most worlds if p(doom) is 14% (the number stated on the website). It seems highly impractical to implement compared to more common regulatory frameworks and is also super unaesthetic because I am generally pro-progress.
The economic and political landscape around AI is complicated enough that agreeing with their stated goals is not enough; you need to agree with their theory of change.
Broad public movements require making alliances which can be harmful in the long term. Environmentalism turned anti-nuclear, a decades-long mistake which has accelerated climate change by years. PauseAI wants to include people who oppose AI on its present dangers, which makes me uneasy. What if the landscape changes such that the best course of action is contrary to PauseAI’s current goals?
I think PauseAI’s theory of change is weak
From reading the website, they want to leverage protests, volunteer lobbying, and informing the public into an international treaty banning superhuman AI and a unilateral supply-chain pause. It seems hard for the general public to have significant influence over this kind of issue unless AI rises to the top issue for most Americans, since the current top issue is improving the economy, which directly conflicts with a pause.
There are better theories of change
Strengthening RSPs into industry standards, then regulations.
Directly informing elites about the dangers of AI, rather than the general public.
History (e.g. civil rights movement) shows that moderates not publicly endorsing radicals can result in a positive radical flank effect making moderates’ goals easier to achieve.
I basically agree with this. The reason the paper didn’t include this kind of reasoning (only a paragraph about how AGI will have infinite horizon length) is we felt that making a forecast based on a superexponential trend would be too much speculation for an academic paper. (There is really no way to make one without heavy reliance on priors; does it speed up by 10% per doubling or 20%?) It wasn’t necessary given the 2027 and 2029-2030 dates for 1-month AI derived from extrapolation already roughly bracketed our uncertainty.
External validity is a huge concern, so we don’t claim anything as ambitious as average knowledge worker tasks. In one sentence, my opinion is that our tasks suite is fairly representative of well-defined, low-context, measurable software tasks that can be done without a GUI. More speculatively, horizons on this are probably within a large (~10x) constant factor of horizons on most other software tasks. We have a lot more discussion of this in the paper, especially in heading 7.2.1 “Systematic differences between our tasks and real tasks”. The HCAST paper also has a better description of the dataset.
We didn’t try to make the dataset a perfectly stratified sample of tasks meeting that description, but there is enough diversity in the dataset that I’m much more concerned about relevance of HCAST-like tasks to real life than relevance of HCAST to the universe of HCAST-like tasks.
Humans don’t need 10x more memory per step nor 100x more compute to do a 10-year project than a 1-year project, so this is proof it isn’t a hard constraint. It might need an architecture change but if the Gods of Straight Lines control the trend, AI companies will invent it as part of normal algorithmic progress and we will remain on an exponential / superexponential trend.
Regarding 1 and 2, I basically agree that SWAA doesn’t provide much independent signal. The reason we made SWAA was that models before GPT-4 got ~0% on HCAST, so we needed shorter tasks to measure their time horizon. 3 is definitely a concern and we’re currently collecting data on open-source PRs to get a more representative sample of long tasks.
That bit at the end about “time horizon of our average baseliner” is a little confusing to me, but I understand it to mean “if we used the 50% reliability metric on the humans we had do these tasks, our model would say humans can’t reliably perform tasks that take longer than an hour”. Which is a pretty interesting point.
That’s basically correct. To give a little more context for why we don’t really believe this number, during data collection we were not really trying to measure the human success rate, just get successful human runs and measure their time. It was very common for baseliners to realize that finishing the task would take too long, give up, and try to collect speed bonuses on other tasks. This is somewhat concerning for biasing the human time-to-complete estimates, but much more concerning for this human time horizon measurement. So we don’t claim the human time horizon as a result.
I’d guess that a cheaper, wall-mounted version of CleanAirKits/Airfanta would be a better product. It’s just a box with fans and slots for filters, the installation labor is significantly lower, you get better aesthetics, and not everyone has 52 inch ceiling fans at a standardized mounting length already so the market is potentially much larger with a standalone device.
The problem with the ceiling fan is that it’s not designed for static pressure, so its effectiveness at moving air through the filter will depend on contingent factors like the blade area ratio and distance from the ceiling (which determines how much filter area you can fit). 180 CFM @ 33 dB is better than the Coway but only matches a single box with ~7 PC fans on medium; you can do better with a custom designed ceiling fan but at that point the fan needs to be part of the product.