Thomas Kwa comments on Thomas Kwa’s Shortform

Thomas Kwa 9 Apr 2025 21:09 UTC

14 points

Benchmark Readiness Level

Safety-relevant properties should be ranked on a “Benchmark Readiness Level” (BRL) scale, inspired by NASA’s Technology Readiness Levels. At BRL 4, a benchmark exists; at BRL 6 the benchmark is highly valid; past this point the benchmark becomes increasingly robust against sandbagging. The definitions could look something like this:

BRL	`Definition`	Example
1	`Theoretical relevance to x-risk defined`	Adversarial competence
2	`Property operationalized for frontier AIs and ASIs`	AI R&D speedup; Misaligned goals
3	`Behavior (or all parts) observed in artificial settings. Preliminary measurements exist, but may have large methodological flaws.`	Reward hacking
4	`Benchmark developed, but may measure different core skills from the ideal measure`	Cyber offense (CyBench)
5	`Benchmark measures roughly what we want; superhuman range; methodology is documented and reproducible but may have validity concerns.`	Software (HCAST++)
6	`"Production quality" high-validity benchmark. Strongly superhuman range; run on many frontier models; red-teamed for validity; represents all sub-capabilities. Portable implementation.`
7	`Extensive validity checks against downstream properties; reasonable attempts (e.g. fine-tuning) to detect whether AIs are manipulating/sandbagging their scores.`	Knowledge (MMLU with fine-tuning)
8		−
9	`Benchmark has high validity to real-world contexts beyond lab settings. Can ensure accuracy even when measuring superintelligences with potentially deceptive capabilities.`	-

Here’s a draft list of properties we could apply the BRL to, including capabilities and propensity:

General capabilities
- Software ability (HCAST, others)
- Knowledge (Humanity’s Last Exam, others)
Sandbagging
- Situational awareness
- Alignment faking
- Sandbagging ability
Monitorability
- Steganographic collusion
- Neuralese
- Faithful CoT
- Ability to monitor smarter AIs
Research acceleration
- % LoC written by AIs
- AI R&D Uplift
Escape risk
- Adversarial competence
- Self-exfiltration
- Cyber offense
Misalignment
- Reward hacking
- Misaligned goals
Other Dangers
- CBRN
- Persuasion

Is this doomed? Am I missing anything important?