Member of technical staff at METR.
Previously: Vivek Hebbar’s team at MIRI → Adrià Garriga-Alonso onvarious empirical alignment projects → METR.
I have signed no contracts or agreements whose existence I cannot mention.
Member of technical staff at METR.
Previously: Vivek Hebbar’s team at MIRI → Adrià Garriga-Alonso onvarious empirical alignment projects → METR.
I have signed no contracts or agreements whose existence I cannot mention.
There is a decreasing curve of Gemini success probability vs average human time on questions in the benchmark, and the curve intersects 50% at roughly 110 minutes.
Basically it’s trying to measure the same quantity as the original paper (https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/) but the numbers are less accurate since we have less data for these benchmarks.
oops, this was on my work account from which you can’t make public links. Replaced the link with the prompt and beginning of o3 output.
o3 has the same conclusion with a slightly different prompt.
Read this comment exchange and come to a definitive conclusion about whether Garrett Baker is accurately representing Matthew. Focus on content rather than tone:
Conclusion: Garrett is not accurately representing Matthew’s position.
Below is a point‑by‑point comparison that shows where Garrett’s paraphrases diverge from what Matthew is actually claiming (ignoring tone and focusing only on the content).
There was a unit conversion mistake, it should have been 80 minutes. Now fixed.
Besides that, I agree with everything here; these will all be fixed in the final blog post. I already looked at one of the 30m-1h questions and it appeared to be doable in ~3 minutes with the ability to ctrl-f transcripts but would take longer without transcripts, unknown how long.
In the next version I will probably use the no-captions AI numbers and measure myself without captions to get a rough video speed multiplier, then possibly do better stats that separate out domains with strong human-time-dependent difficulty from domains without (like this and SWE-Lancer).
I would love to have Waymo data. It looks like it’s only available since September 2024 so I’ll still need to use Tesla for the earlier period. More critically they don’t publish disengagement data, only crash/injury. There are Waymo claims of things like 1 disengagement every 17,000 miles but I don’t believe them without a precise definition for what this number represents.
We know AI time horizons (human time-to-complete at which a model has a 50% success rate) on software tasks are currently ~1.5hr and doubling every 4-7 months, but what about other domains? Here’s a preliminary result comparing METR’s task suite (orange line) to benchmarks in other domains, all of which have some kind of grounding in human data:
Observations
Time horizons on agentic computer use (OSWorld) is ~100x shorter than other domains. Domains like Tesla self-driving (tesla_fsd), scientific knowledge (gpqa), and math contests (aime), video understanding (video_mme), and software (hcast_r_s) all have roughly similar horizons.
My guess is this means models are good at taking in information from a long context but bad at acting coherently. Most work requires agency like OSWorld, which may be why AIs can’t do the average real-world 1-hour task yet.
There are likely other domains that fall outside this cluster; these are just the five I examined
Note the original version had a unit conversion error that gave 60x too high horizons for video_mme; this has been fixed (thanks @ryan_greenblatt )
Rate of improvement varies significantly; math contests have improved ~50x in the last year but Tesla self-driving only 6x in 3 years.
HCAST is middle of the pack in both.
Note this is preliminary and uses a new methodology so there might be data issues. I’m currently writing up a full post!
Is this graph believable? What do you want to see analyzed?
edit: fixed Video-MME numbers
I’d guess that a cheaper, wall-mounted version of CleanAirKits/Airfanta would be a better product. It’s just a box with fans and slots for filters, the installation labor is significantly lower, you get better aesthetics, and not everyone has 52 inch ceiling fans at a standardized mounting length already so the market is potentially much larger with a standalone device.
The problem with the ceiling fan is that it’s not designed for static pressure, so its effectiveness at moving air through the filter will depend on contingent factors like the blade area ratio and distance from the ceiling (which determines how much filter area you can fit). 180 CFM @ 33 dB is better than the Coway but only matches a single box with ~7 PC fans on medium; you can do better with a custom designed ceiling fan but at that point the fan needs to be part of the product.
What is required for AI to provide net speedup to a software engineering project, when humans write higher quality code than AIs? It depends how it’s used.
Cursor regime
In this regime, similar to how I use Cursor agent mode, the human has to read every line of code the AI generates, so we can write:
Where
is the time for the human to write the code, either from scratch or after rejecting an AI suggestion
is the time for the AI to generate the code in tokens per second.
is the time for the human to check the code, accept or reject it, and make any minor revisions, in order to bring the code quality and probability of bugs equal with human-written code ()
is the fraction of AI suggestions that are rejected entirely.
Note this neglects other factors like code review time, code quality, bugs that aren’t caught by the human, or enabling things the human can’t do.
Autonomous regime
In this regime the AI is reliable enough that the human doesn’t check all the code for bugs, and instead eats the chance of costly bugs entering the codebase.
is the added probability of a bug from the AI agent compared to a human
is the expected cost of a bug, including revenue loss, compromising other projects, and time required to fix the bug. This can be higher than
Verifiable regime
If task success is cheaply verifiable e.g. if comprehensive tests are already written or other AIs can verify the code, then bugs are impossible except through reward hacking.
Here is lower than in the other regimes because the verifier can help the generator AI understand the task, and can be faster too because you can parallelize.
Although the AI is always faster than writing the code than the human, overall speedup is subject to Amdahl’s law, i.e. limited by the slowest component. If AI generation is only 3x as fast as the human per line of code, speedup will never be faster than 3x. Even when AI mistake rate is low enough that we move to the autonomous regime, we still have and , both of which can be significant fractions of currently, especially for expert humans who are fast at writing code and projects that are few lines of code. Therefore I expect >5x speedups to overall projects only when (a) AIs write higher-quality code than humans, or (b) tasks are cheaply verifiable and reward hacking is rare, or (c) both and are much faster than in my current work.
I made a simple Desmos model here.
Why not require model organisms with known ground truth and see if the methods accurately reveal them, like in the paper? From the abstract of that paper:
Additionally, we argue for scientists using complex non-linear dynamical systems with known ground truth, such as the microprocessor as a validation platform for time-series and structure discovery methods.
This reduces the problem from covering all sources of doubt to making a sufficiently realistic model organism. This was our idea with InterpBench, and I still find it plausible that with better execution one could iterate on it (or the probe, crosscoder, etc. equivalent) and make interpretability progress.
Author here. When constructing this paper, we needed an interpretable metric (time horizon), but this is not very data-efficient. We basically made the fewest tasks we could to get acceptable error bars, because high-quality task creation from scratch is very expensive. (We already spent over $150k baselining tasks, and more on the bounty and baselining infra.) Therefore we should expect that restricting to only 32 of the 170 tasks in the paper makes the error bars much wider; it roughly increases the error bars by a factor of sqrt(170/32) = 2.3.
Now if these tasks were log-uniformly distributed from 1 second to 16 hours, we would still be able to compare these results to the rest of our dataset, but it’s worse than that: all fully_private tasks are too difficult for models before GPT-4, so removing SWAA requires restricting the x axis to the 2 years since GPT-4. This is 1⁄3 of our x axis range, so it makes error bars on the trendline slope 3 times larger. Combined with the previous effect the error bars become 6.9 times larger than the main paper! So the analysis in this post, although it uses higher-quality data, basically throws away all the statistical power of the paper. I wish I had 170 high-quality private tasks but there was simply not enough time to construct them. Likewise, I wish we had baselines for all tasks, but sadly we will probably move to a different safety case framework before we get them.
Though SWAA tasks have their own problems (eg many of them were written by Claude), they were developed in-house and are private, and it’s unlikely that GPT-2 and GPT-3 would have a horizon like 10x different on a different benchmark. So for this check we really should not exclude them.
privacy_level | num_tasks |
fully_private | 32 |
public_problem | 22 |
easy_to_memorize | 18 |
public_solution | 17 |
semi_private | 2 |
When we include SWAA, the story is not too different from the original paper: doubling roughly 8 months. Note how large the error bars are.
With only fully_private tasks and SWAA and RE-Bench, the extrapolated date of 1-month AI looks similar to the main paper, but the error bars are much larger; this is entirely driven by the small number of tasks (green bar).
For comparison, the main paper extrapolation
When I exclude SWAA and RE-Bench too, the script refuses to even run, because sometimes the time horizon slope is negative, but when I suppress those errors and take the 2024-2025 trend we get 80% CI of early 2026 to mid-2047! This is consistent with the main result but pretty uninformative.
You can see that with only fully_private tasks (ignore the tasks under 1 minute; these are SWAA), it’s hard to even tell whether longer tasks are more difficult in the <1 hour range (tiles plot). As such we should be suspicious of whether the time horizon metric works at all with so few tasks.
I ran the horizon length graph with pass@8 instead, and the increase between GPT-4 and o1 seems to be slightly smaller (could be noise), and also Claude 3.7 does worse than o1. This means the doubling rate for pass@8 may be slightly slower than for pass@1. However, if horizon length increase since 2023 were only due to RL, the improvement from pass@8 would be barely half as much as the improvement in pass@1. It’s faster, which could be due to some combination of the following:
o1 has a better base model than GPT-4
HCAST is an example of “emergent capabilities on long-horizon tasks”, unlike their non-agentic tasks
RL helps more on HCAST skills than math or non-agentic coding
RL helps more on larger models
Some kind of nonlinearity in the relevant metrics
There are various problems with this graph (e.g. to follow the same methodology we should filter for models that are frontier on pass@8, not frontier on pass@1) but this was meant to be a quick and dirty check.
GPT-4 horizon | o1 horizon | Ratio | |
Pass@8 | 24 min | 151 min | 6.3x |
Pass@1 | 5 min | 39 min | 7.8x |
Agree, I’m pretty confused about this discrepancy. I can’t rule out that it’s just the “RL can enable emergent capabilities” point.
The dates used in our regression are the dates models were publicly released, not the dates we benchmarked them. If we use the latter dates, or the dates they were announced, I agree they would be more arbitrary.
Also, there is lots of noise in a time horizon measurement and it only displays any sort of pattern because we measured over many orders of magnitude and years. It’s not very meaningful to extrapolate from just 2 data points; there are many reasons one datapoint could randomly change by a couple of months or factor of 2 in time horizon.
Release schedules could be altered
A model could be overfit to our dataset
One model could play less well with our elicitation/scaffolding
One company could be barely at the frontier, and release a slightly-better model right before the leading company releases a much-better model.
All of these factors are averaged out if you look at more than 2 models. So I prefer to see each model as evidence of whether the trend is accelerating or slowing down over the last 1-2 years, rather than an individual model being very meaningful.
o3 and o4-mini solve more than zero of the >1hr tasks that claude 3.7 got ~zero on, including some >4hr tasks that no previous models we tested have done well on, so it’s not that models hit a wall at 1-4 hours. My guess is that the tasks they have been trained on are just more similar to HCAST tasks than RE-Bench tasks, though there are other possibilities.
Other metrics also point to drone-dominated and C&C dominated war. E.g. towed artillery is too vulnerable to counterbattery fire, and modern mobile artillery like CAESAR must use “shoot and scoot” tactics—it can fire 6 shells within two minutes of stopping and vacate before its last shell lands. But now drones attack them while moving too.
Yes. RL will at least be more applicable to well-defined tasks. Some intuitions:
In my everyday, the gap between well-defined task ability and working with the METR codebase is growing
4 month doubling time is faster than the rate of progress in most other realistic or unrealistic domains
Recent models really like to reward hack, suggesting that RL can cause some behaviors not relevant to realistic tasks
This trend will break at some point, eg when labs get better at applying RL to realistic tasks, or when RL hits diminishing returns, but I have no idea when
GDM paper: Evaluating the Goal-Directedness of Large Language Models
Tom Everitt, Rohin Shah, and others from GDM attempt to measure “whether LLMs use their capabilities towards their given goal”. Unlike previous work, their measure is not just rescaled task performance—rather, an AI is goal-directed if it uses its capabilities effectively. A model that is not goal-directed when attempting a task will have capabilities but not properly use them. Thus, we can measure goal-directedness by comparing a model’s actual performance to how it should perform if it were using its full capabilities on subtasks.
This kind of measure could in principle be useful to measure:
prerequisites of scheming: maybe a model is not capable of scheming until it reaches some level of goal-directedness, and we can forecast when that happens
elicitation: a normally goal-directed model that is not trying on evals may be underelicited
sandbagging: if an AI unexpectedly has low goal-directedness on evals, this could be a warning sign for sandbagging.
Unfortunately, it’s mostly saturated already—Gemini 2.0 and Claude 3.7 are over 95%. Even GPT-4 gets over 70%.
So until we measure models on tasks where they are currently non-goal-directed, we can’t tell whether future models are more goal-directed than Claude 3.7, precluding use (1). I also feel like scheming will require a qualitatively different kind of goal-directedness more related to instrumental reasoning. This measure seems potentially useful for measuring elicitation and sandbagging, though.
Another limitation is that this can only be measured on tasks that can be cleanly decomposed into subtasks, and whose performance is mathematically predictable from subtask performance. The algorithms they use to estimate (Appendix C) basically sample performance on every subtask and compute task performance from this.
Time horizon of o3 is ~1.5 hours vs Claude 3.7′s 54 minutes, and it’s statistically significant that it’s above the long-term trend. It’s been less than 2 months since the release of Claude 3.7. If time horizon continues doubling every 3.5 months as it has over the last year, we only have another 12 months until time horizon hits 16 hours and we are unable to measure it with HCAST.
My guess is that future model time horizon will double every 3-4 months for well-defined tasks (HCAST, RE-Bench, most automatically scorable tasks) that labs can RL on, while capability on more realistic tasks will follow the long-term 7-month doubling time.
Benchmark Readiness Level
Safety-relevant properties should be ranked on a “Benchmark Readiness Level” (BRL) scale, inspired by NASA’s Technology Readiness Levels. At BRL 4, a benchmark exists; at BRL 6 the benchmark is highly valid; past this point the benchmark becomes increasingly robust against sandbagging. The definitions could look something like this:
BRL | Definition | Example |
1 | Theoretical relevance to x-risk defined | Adversarial competence |
2 | Property operationalized for frontier AIs and ASIs | AI R&D speedup; Misaligned goals |
3 | Behavior (or all parts) observed in artificial settings. Preliminary measurements exist, but may have large methodological flaws. | Reward hacking |
4 | Benchmark developed, but may measure different core skills from the ideal measure | Cyber offense (CyBench) |
5 | Benchmark measures roughly what we want; superhuman range; methodology is documented and reproducible but may have validity concerns. | Software (HCAST++) |
6 | "Production quality" high-validity benchmark. Strongly superhuman range; run on many frontier models; red-teamed for validity; represents all sub-capabilities. Portable implementation. | |
7 | Extensive validity checks against downstream properties; reasonable attempts (e.g. fine-tuning) to detect whether AIs are manipulating/sandbagging their scores. | Knowledge (MMLU with fine-tuning) |
8 | − | |
9 | Benchmark has high validity to real-world contexts beyond lab settings. Can ensure accuracy even when measuring superintelligences with potentially deceptive capabilities. | - |
Here’s a draft list of properties we could apply the BRL to, including capabilities and propensity:
General capabilities
Software ability (HCAST, others)
Knowledge (Humanity’s Last Exam, others)
Sandbagging
Situational awareness
Alignment faking
Sandbagging ability
Monitorability
Steganographic collusion
Neuralese
Faithful CoT
Ability to monitor smarter AIs
Research acceleration
% LoC written by AIs
AI R&D Uplift
Escape risk
Adversarial competence
Self-exfiltration
Cyber offense
Misalignment
Reward hacking
Misaligned goals
Other Dangers
CBRN
Persuasion
Is this doomed? Am I missing anything important?
I don’t run the evaluations but probably we will; no timeframe yet though as we would need to do elicitation first. Claude’s SWE-bench Verified scores suggest that it will be above 2 hours on the METR task set; the benchmarks are pretty similar apart from their different time annotations.