Safety-relevant properties should be ranked on a “Benchmark Readiness Level” (BRL) scale, inspired by NASA’s Technology Readiness Levels. At BRL 4, a benchmark exists; at BRL 6 the benchmark is highly valid; past this point the benchmark becomes increasingly robust against sandbagging. The definitions could look something like this:
BRL
Definition
Example
1
Theoretical relevance to x-risk defined
Adversarial competence
2
Property operationalized for frontier AIs and ASIs
AI R&D speedup; Misaligned goals
3
Behavior (or all parts) observed in artificial settings. Preliminary measurements exist, but may have large methodological flaws.
Benchmark developed, but may measure different core skills from the ideal measure
Cyber offense (CyBench)
5
Benchmark measures roughly what we want; superhuman range; methodology is documented and reproducible but may have validity concerns.
Software (HCAST++)
6
"Production quality" high-validity benchmark. Strongly superhuman range; run on many frontier models; red-teamed for validity; represents all sub-capabilities. Portable implementation.
7
Extensive validity checks against downstream properties; reasonable attempts (e.g. fine-tuning) to detect whether AIs are manipulating/sandbagging their scores.
Knowledge (MMLU with fine-tuning)
8
−
9
Benchmark has high validity to real-world contexts beyond lab settings. Can ensure accuracy even when measuring superintelligences with potentially deceptive capabilities.
-
Here’s a draft list of properties we could apply the BRL to, including capabilities and propensity:
Benchmark Readiness Level
Safety-relevant properties should be ranked on a “Benchmark Readiness Level” (BRL) scale, inspired by NASA’s Technology Readiness Levels. At BRL 4, a benchmark exists; at BRL 6 the benchmark is highly valid; past this point the benchmark becomes increasingly robust against sandbagging. The definitions could look something like this:
DefinitionTheoretical relevance to x-risk definedProperty operationalized for frontier AIs and ASIsMisaligned goals
Behavior (or all parts) observed in artificial settings. Preliminary measurements exist, but may have large methodological flaws.Benchmark developed, but may measure different core skills from the ideal measureBenchmark measures roughly what we want; superhuman range; methodology is documented and reproducible but may have validity concerns.(HCAST++)
"Production quality" high-validity benchmark. Strongly superhuman range; run on many frontier models; red-teamed for validity; represents all sub-capabilities. Portable implementation.Extensive validity checks against downstream properties; reasonable attempts (e.g. fine-tuning) to detect whether AIs are manipulating/sandbagging their scores.(MMLU with fine-tuning)
Benchmark has high validity to real-world contexts beyond lab settings. Can ensure accuracy even when measuring superintelligences with potentially deceptive capabilities.Here’s a draft list of properties we could apply the BRL to, including capabilities and propensity:
General capabilities
Software ability (HCAST, others)
Knowledge (Humanity’s Last Exam, others)
Sandbagging
Situational awareness
Alignment faking
Sandbagging ability
Monitorability
Steganographic collusion
Neuralese
Faithful CoT
Ability to monitor smarter AIs
Research acceleration
% LoC written by AIs
AI R&D Uplift
Escape risk
Adversarial competence
Self-exfiltration
Cyber offense
Misalignment
Reward hacking
Misaligned goals
Other Dangers
CBRN
Persuasion
Is this doomed? Am I missing anything important?