Thomas Kwa

Karma: 6,247

Member of technical staff at METR.

Previously: Vivek Hebbar’s team at MIRI → Adrià Garriga-Alonso on various empirical alignment projects → METR.

I have signed no contracts or agreements whose existence I cannot mention.

Thomas Kwa May 24, 2025, 4:12 PM
16 points
0
in reply to: Cole Wyeth’s comment on: Cole Wyeth’s Shortform
I don’t run the evaluations but probably we will; no timeframe yet though as we would need to do elicitation first. Claude’s SWE-bench Verified scores suggest that it will be above 2 hours on the METR task set; the benchmarks are pretty similar apart from their different time annotations.

Thomas Kwa May 21, 2025, 6:07 PM
2 points
0
in reply to: Ted Sanders’s comment on: Thomas Kwa’s Shortform
There is a decreasing curve of Gemini success probability vs average human time on questions in the benchmark, and the curve intersects 50% at roughly 110 minutes.
Basically it’s trying to measure the same quantity as the original paper (https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/) but the numbers are less accurate since we have less data for these benchmarks.

Thomas Kwa May 21, 2025, 1:46 AM
3 points
0
in reply to: Guive’s comment on: AI Doomerism in 1879
oops, this was on my work account from which you can’t make public links. Replaced the link with the prompt and beginning of o3 output.

Thomas Kwa May 21, 2025, 1:34 AM
3 points
0
in reply to: Matthew Barnett’s comment on: AI Doomerism in 1879
o3 has the same conclusion with a slightly different prompt.
Read this comment exchange and come to a definitive conclusion about whether Garrett Baker is accurately representing Matthew. Focus on content rather than tone:
Conclusion: Garrett is not accurately representing Matthew’s position.
Below is a point‑by‑point comparison that shows where Garrett’s paraphrases diverge from what Matthew is actually claiming (ignoring tone and focusing only on the content).

Thomas Kwa May 21, 2025, 12:21 AM
LW: 4 AF: 2
0
AF
in reply to: ryan_greenblatt’s comment on: Thomas Kwa’s Shortform
There was a unit conversion mistake, it should have been 80 minutes. Now fixed.
Besides that, I agree with everything here; these will all be fixed in the final blog post. I already looked at one of the 30m-1h questions and it appeared to be doable in ~3 minutes with the ability to ctrl-f transcripts but would take longer without transcripts, unknown how long.
In the next version I will probably use the no-captions AI numbers and measure myself without captions to get a rough video speed multiplier, then possibly do better stats that separate out domains with strong human-time-dependent difficulty from domains without (like this and SWE-Lancer).

Thomas Kwa May 20, 2025, 10:55 PM
LW: 5 AF: 2
0
AF
in reply to: gwern’s comment on: Thomas Kwa’s Shortform
I would love to have Waymo data. It looks like it’s only available since September 2024 so I’ll still need to use Tesla for the earlier period. More critically they don’t publish disengagement data, only crash/injury. There are Waymo claims of things like 1 disengagement every 17,000 miles but I don’t believe them without a precise definition for what this number represents.

Thomas Kwa May 20, 2025, 5:03 AM
LW: 120 AF: 40
8
AF
on: Thomas Kwa’s Shortform
Cross-domain time horizon:
We know AI time horizons (human time-to-complete at which a model has a 50% success rate) on software tasks are currently ~1.5hr and doubling every 4-7 months, but what about other domains? Here’s a preliminary result comparing METR’s task suite (orange line) to benchmarks in other domains, all of which have some kind of grounding in human data:
Observations
- Time horizons on agentic computer use (OSWorld) is ~100x shorter than other domains. Domains like Tesla self-driving (tesla_fsd), scientific knowledge (gpqa), and math contests (aime), video understanding (video_mme), and software (hcast_r_s) all have roughly similar horizons.
  - My guess is this means models are good at taking in information from a long context but bad at acting coherently. Most work requires agency like OSWorld, which may be why AIs can’t do the average real-world 1-hour task yet.
  - There are likely other domains that fall outside this cluster; these are just the five I examined
  - Note the original version had a unit conversion error that gave 60x too high horizons for video_mme; this has been fixed (thanks @ryan_greenblatt )
- Rate of improvement varies significantly; math contests have improved ~50x in the last year but Tesla self-driving only 6x in 3 years.
- HCAST is middle of the pack in both.
Note this is preliminary and uses a new methodology so there might be data issues. I’m currently writing up a full post!
Is this graph believable? What do you want to see analyzed?
edit: fixed Video-MME numbers
What links here?
- AI #117: OpenAI Buys Device Maker IO by Zvi (May 22, 2025, 1:40 PM; 37 points)
- Thomas Kwa's comment on Where’s my ten minute AGI? by Vasco Grilo🔸 (EA Forum; May 20, 2025, 6:06 AM; 17 points)

Thomas Kwa May 12, 2025, 2:18 AM
15 points
1
on: Better Air Purifiers
I’d guess that a cheaper, wall-mounted version of CleanAirKits/Airfanta would be a better product. It’s just a box with fans and slots for filters, the installation labor is significantly lower, you get better aesthetics, and not everyone has 52 inch ceiling fans at a standardized mounting length already so the market is potentially much larger with a standalone device.

The problem with the ceiling fan is that it’s not designed for static pressure, so its effectiveness at moving air through the filter will depend on contingent factors like the blade area ratio and distance from the ceiling (which determines how much filter area you can fit). 180 CFM @ 33 dB is better than the Coway but only matches a single box with ~7 PC fans on medium; you can do better with a custom designed ceiling fan but at that point the fan needs to be part of the product.

Thomas Kwa May 9, 2025, 8:14 PM
18 points
0
on: Thomas Kwa’s Shortform
The uplift equation:
What is required for AI to provide net speedup to a software engineering project, when humans write higher quality code than AIs? It depends how it’s used.
Cursor regime
In this regime, similar to how I use Cursor agent mode, the human has to read every line of code the AI generates, so we can write:
$Speedup factor = \frac{t_{H}}{t_{A I + H}} = \frac{t_{H}}{t_{p r o m p t} + t_{A I g e n} + t_{c h e c k} + p_{f a i l} \cdot t_{H}}$
Where
- $t_{H}$ is the time for the human to write the code, either from scratch or after rejecting an AI suggestion
- $t_{A I g e n}$ is the time for the AI to generate the code in tokens per second.
- $t_{c h e c k}$ is the time for the human to check the code, accept or reject it, and make any minor revisions, in order to bring the code quality and probability of bugs equal with human-written code ( $Δ p_{b u g} = 0$ )
- $p_{f a i l}$ is the fraction of AI suggestions that are rejected entirely.
Note this neglects other factors like code review time, code quality, bugs that aren’t caught by the human, or enabling things the human can’t do.
Autonomous regime
In this regime the AI is reliable enough that the human doesn’t check all the code for bugs, and instead eats the chance of costly bugs entering the codebase.
$Speedup factor = \frac{t_{H}}{t_{A I}} = \frac{t_{H}}{t_{p r o m p t} + t_{A I g e n} + Δ p_{b u g} \cdot t_{b u g c o s t} + p_{f a i l} \cdot t_{H}}$
- $Δ p_{b u g}$ is the added probability of a bug from the AI agent compared to a human
- $t_{b u g c o s t}$ is the expected cost of a bug, including revenue loss, compromising other projects, and time required to fix the bug. This can be higher than $t_{H}$
Verifiable regime
If task success is cheaply verifiable e.g. if comprehensive tests are already written or other AIs can verify the code, then bugs are impossible except through reward hacking.
$Speedup factor = \frac{t_{H}}{t_{A I}} = \frac{t_{H}}{t_{p r o m p t} + t_{A I g e n} + p_{r e w a r d h a c k} \cdot t_{b u g c o s t} + p_{f a i l} \cdot t_{H}}$
Here $t_{p r o m p t}$ is lower than in the other regimes because the verifier can help the generator AI understand the task, and $t_{A I g e n}$ can be faster too because you can parallelize.
Observations
Although the AI is always faster than writing the code than the human, overall speedup is subject to Amdahl’s law, i.e. limited by the slowest component. If AI generation is only 3x as fast as the human per line of code, speedup will never be faster than 3x. Even when AI mistake rate is low enough that we move to the autonomous regime, we still have $t_{p r o m p t}$ and $t_{A I g e n}$ , both of which can be significant fractions of $t_{H}$ currently, especially for expert humans who are fast at writing code and projects that are few lines of code. Therefore I expect >5x speedups to overall projects only when (a) AIs write higher-quality code than humans, or (b) tasks are cheaply verifiable and reward hacking is rare, or (c) both $t_{p r o m p t}$ and $t_{A I g e n}$ are much faster than in my current work.
I made a simple Desmos model here.

Thomas Kwa May 8, 2025, 12:55 AM
6 points
0
in reply to: Zach Furman’s comment on: Zach Furman’s Shortform
Why not require model organisms with known ground truth and see if the methods accurately reveal them, like in the paper? From the abstract of that paper:
Additionally, we argue for scientists using complex non-linear dynamical systems with known ground truth, such as the microprocessor as a validation platform for time-series and structure discovery methods.
This reduces the problem from covering all sources of doubt to making a sufficiently realistic model organism. This was our idea with InterpBench, and I still find it plausible that with better execution one could iterate on it (or the probe, crosscoder, etc. equivalent) and make interpretability progress.

Thomas Kwa May 6, 2025, 10:31 PM
49 points
0
on: Notes on the Long Tasks METR paper, from a HCAST task contributor
Author here. When constructing this paper, we needed an interpretable metric (time horizon), but this is not very data-efficient. We basically made the fewest tasks we could to get acceptable error bars, because high-quality task creation from scratch is very expensive. (We already spent over $150k baselining tasks, and more on the bounty and baselining infra.) Therefore we should expect that restricting to only 32 of the 170 tasks in the paper makes the error bars much wider; it roughly increases the error bars by a factor of sqrt(170/32) = 2.3.
Now if these tasks were log-uniformly distributed from 1 second to 16 hours, we would still be able to compare these results to the rest of our dataset, but it’s worse than that: all fully_private tasks are too difficult for models before GPT-4, so removing SWAA requires restricting the x axis to the 2 years since GPT-4. This is ¹⁄₃ of our x axis range, so it makes error bars on the trendline slope 3 times larger. Combined with the previous effect the error bars become 6.9 times larger than the main paper! So the analysis in this post, although it uses higher-quality data, basically throws away all the statistical power of the paper. I wish I had 170 high-quality private tasks but there was simply not enough time to construct them. Likewise, I wish we had baselines for all tasks, but sadly we will probably move to a different safety case framework before we get them.
Though SWAA tasks have their own problems (eg many of them were written by Claude), they were developed in-house and are private, and it’s unlikely that GPT-2 and GPT-3 would have a horizon like 10x different on a different benchmark. So for this check we really should not exclude them.
privacy_level num_tasks
fully_private 32
public_problem 22
easy_to_memorize 18
public_solution 17
semi_private 2
When we include SWAA, the story is not too different from the original paper: doubling roughly 8 months. Note how large the error bars are.
With only fully_private tasks and SWAA and RE-Bench, the extrapolated date of 1-month AI looks similar to the main paper, but the error bars are much larger; this is entirely driven by the small number of tasks (green bar).
For comparison, the main paper extrapolation
When I exclude SWAA and RE-Bench too, the script refuses to even run, because sometimes the time horizon slope is negative, but when I suppress those errors and take the 2024-2025 trend we get 80% CI of early 2026 to mid-2047! This is consistent with the main result but pretty uninformative.
You can see that with only fully_private tasks (ignore the tasks under 1 minute; these are SWAA), it’s hard to even tell whether longer tasks are more difficult in the <1 hour range (tiles plot). As such we should be suspicious of whether the time horizon metric works at all with so few tasks.

Thomas Kwa May 5, 2025, 10:44 PM
32 points
0
on: Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
I ran the horizon length graph with pass@8 instead, and the increase between GPT-4 and o1 seems to be slightly smaller (could be noise), and also Claude 3.7 does worse than o1. This means the doubling rate for pass@8 may be slightly slower than for pass@1. However, if horizon length increase since 2023 were only due to RL, the improvement from pass@8 would be barely half as much as the improvement in pass@1. It’s faster, which could be due to some combination of the following:
- o1 has a better base model than GPT-4
- HCAST is an example of “emergent capabilities on long-horizon tasks”, unlike their non-agentic tasks
- RL helps more on HCAST skills than math or non-agentic coding
- RL helps more on larger models
- Some kind of nonlinearity in the relevant metrics
There are various problems with this graph (e.g. to follow the same methodology we should filter for models that are frontier on pass@8, not frontier on pass@1) but this was meant to be a quick and dirty check.
GPT-4 horizon o1 horizon Ratio
Pass@8 24 min 151 min 6.3x
Pass@1 5 min 39 min 7.8x
What links here?
- Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? by Thomas Kwa (May 5, 2025, 6:56 PM; 68 points)

Thomas Kwa May 5, 2025, 9:27 PM
2 points
0
in reply to: J Bostock’s comment on: Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Agree, I’m pretty confused about this discrepancy. I can’t rule out that it’s just the “RL can enable emergent capabilities” point.

Thomas Kwa May 3, 2025, 7:43 PM
LW: 10 AF: 7
4
AF
in reply to: Thane Ruthenis’s comment on: Thane Ruthenis’s Shortform
The dates used in our regression are the dates models were publicly released, not the dates we benchmarked them. If we use the latter dates, or the dates they were announced, I agree they would be more arbitrary.
Also, there is lots of noise in a time horizon measurement and it only displays any sort of pattern because we measured over many orders of magnitude and years. It’s not very meaningful to extrapolate from just 2 data points; there are many reasons one datapoint could randomly change by a couple of months or factor of 2 in time horizon.
- Release schedules could be altered
- A model could be overfit to our dataset
- One model could play less well with our elicitation/scaffolding
- One company could be barely at the frontier, and release a slightly-better model right before the leading company releases a much-better model.
All of these factors are averaged out if you look at more than 2 models. So I prefer to see each model as evidence of whether the trend is accelerating or slowing down over the last 1-2 years, rather than an individual model being very meaningful.

Thomas Kwa May 3, 2025, 5:39 AM
9 points
0
on: Why does METR score o3 as effective for such a long time duration despite overall poor scores?
o3 and o4-mini solve more than zero of the >1hr tasks that claude 3.7 got ~zero on, including some >4hr tasks that no previous models we tested have done well on, so it’s not that models hit a wall at 1-4 hours. My guess is that the tasks they have been trained on are just more similar to HCAST tasks than RE-Bench tasks, though there are other possibilities.

Thomas Kwa May 1, 2025, 8:12 PM
4 points
2
in reply to: Daniel Kokotajlo’s comment on: What a 20-year-lead in military tech might look like
Other metrics also point to drone-dominated and C&C dominated war. E.g. towed artillery is too vulnerable to counterbattery fire, and modern mobile artillery like CAESAR must use “shoot and scoot” tactics—it can fire 6 shells within two minutes of stopping and vacate before its last shell lands. But now drones attack them while moving too.

Thomas Kwa Apr 25, 2025, 1:01 AM
2 points
2
in reply to: snewman’s comment on: METR’s preliminary evaluation of o3 and o4-mini
Yes. RL will at least be more applicable to well-defined tasks. Some intuitions:
- In my everyday, the gap between well-defined task ability and working with the METR codebase is growing
- 4 month doubling time is faster than the rate of progress in most other realistic or unrealistic domains
- Recent models really like to reward hack, suggesting that RL can cause some behaviors not relevant to realistic tasks
This trend will break at some point, eg when labs get better at applying RL to realistic tasks, or when RL hits diminishing returns, but I have no idea when

Thomas Kwa Apr 21, 2025, 9:40 AM
16 points
0
on: Thomas Kwa’s Shortform
GDM paper: Evaluating the Goal-Directedness of Large Language Models
Tom Everitt, Rohin Shah, and others from GDM attempt to measure “whether LLMs use their capabilities towards their given goal”. Unlike previous work, their measure is not just rescaled task performance—rather, an AI is goal-directed if it uses its capabilities effectively. A model that is not goal-directed when attempting a task will have capabilities but not properly use them. Thus, we can measure goal-directedness by comparing a model’s actual performance to how it should perform if it were using its full capabilities on subtasks.
This kind of measure could in principle be useful to measure:
1. prerequisites of scheming: maybe a model is not capable of scheming until it reaches some level of goal-directedness, and we can forecast when that happens
2. elicitation: a normally goal-directed model that is not trying on evals may be underelicited
3. sandbagging: if an AI unexpectedly has low goal-directedness on evals, this could be a warning sign for sandbagging.
Unfortunately, it’s mostly saturated already—Gemini 2.0 and Claude 3.7 are over 95%. Even GPT-4 gets over 70%.
So until we measure models on tasks where they are currently non-goal-directed, we can’t tell whether future models are more goal-directed than Claude 3.7, precluding use (1). I also feel like scheming will require a qualitatively different kind of goal-directedness more related to instrumental reasoning. This measure seems potentially useful for measuring elicitation and sandbagging, though.
Another limitation is that this can only be measured on tasks that can be cleanly decomposed into subtasks, and whose performance is mathematically predictable from subtask performance. The algorithms they use to estimate $E [R_{π_{c}^{*}}]$ (Appendix C) basically sample performance on every subtask and compute task performance from this.

Thomas Kwa Apr 17, 2025, 2:41 AM
9 points
3
on: METR’s preliminary evaluation of o3 and o4-mini
Time horizon of o3 is ~1.5 hours vs Claude 3.7′s 54 minutes, and it’s statistically significant that it’s above the long-term trend. It’s been less than 2 months since the release of Claude 3.7. If time horizon continues doubling every 3.5 months as it has over the last year, we only have another 12 months until time horizon hits 16 hours and we are unable to measure it with HCAST.
My guess is that future model time horizon will double every 3-4 months for well-defined tasks (HCAST, RE-Bench, most automatically scorable tasks) that labs can RL on, while capability on more realistic tasks will follow the long-term 7-month doubling time.
What links here?
- Interpreting the METR Time Horizons Post by snewman (Apr 30, 2025, 3:03 AM; 66 points)

Thomas Kwa Apr 9, 2025, 9:09 PM

14 points

on: Thomas Kwa’s Shortform

Benchmark Readiness Level

Safety-relevant properties should be ranked on a “Benchmark Readiness Level” (BRL) scale, inspired by NASA’s Technology Readiness Levels. At BRL 4, a benchmark exists; at BRL 6 the benchmark is highly valid; past this point the benchmark becomes increasingly robust against sandbagging. The definitions could look something like this:

BRL	`Definition`	Example
1	`Theoretical relevance to x-risk defined`	Adversarial competence
2	`Property operationalized for frontier AIs and ASIs`	AI R&D speedup; Misaligned goals
3	`Behavior (or all parts) observed in artificial settings. Preliminary measurements exist, but may have large methodological flaws.`	Reward hacking
4	`Benchmark developed, but may measure different core skills from the ideal measure`	Cyber offense (CyBench)
5	`Benchmark measures roughly what we want; superhuman range; methodology is documented and reproducible but may have validity concerns.`	Software (HCAST++)
6	`"Production quality" high-validity benchmark. Strongly superhuman range; run on many frontier models; red-teamed for validity; represents all sub-capabilities. Portable implementation.`
7	`Extensive validity checks against downstream properties; reasonable attempts (e.g. fine-tuning) to detect whether AIs are manipulating/sandbagging their scores.`	Knowledge (MMLU with fine-tuning)
8		−
9	`Benchmark has high validity to real-world contexts beyond lab settings. Can ensure accuracy even when measuring superintelligences with potentially deceptive capabilities.`	-

Here’s a draft list of properties we could apply the BRL to, including capabilities and propensity:

General capabilities
- Software ability (HCAST, others)
- Knowledge (Humanity’s Last Exam, others)
Sandbagging
- Situational awareness
- Alignment faking
- Sandbagging ability
Monitorability
- Steganographic collusion
- Neuralese
- Faithful CoT
- Ability to monitor smarter AIs
Research acceleration
- % LoC written by AIs
- AI R&D Uplift
Escape risk
- Adversarial competence
- Self-exfiltration
- Cyber offense
Misalignment
- Reward hacking
- Misaligned goals
Other Dangers
- CBRN
- Persuasion

Is this doomed? Am I missing anything important?

privacy_level	num_tasks
fully_private	32
public_problem	22
easy_to_memorize	18
public_solution	17
semi_private	2

	GPT-4 horizon	o1 horizon	Ratio
Pass@8	24 min	151 min	6.3x
Pass@1	5 min	39 min	7.8x

Thomas Kwa

Cross-domain time horizon:

The uplift equation:

Observations