Model evals for dangerous capabilities
Testing an LM system for dangerous capabilities is crucial for assessing its risks.
Summary of best practices
Best practices for labs evaluating LM systems for dangerous capabilities:
Publish results
Publish questions/tasks/methodology (unless that’s dangerous, e.g. CBRN evals; if so, offer to share more information with other labs, government, and relevant auditors, and publish a small subset)
Do good elicitation and publish details (or at least demonstrate that your elicitation is good):
General finetuning (for “instruction following, tool use, and general agency” and maybe capabilities in the relevant area)
Helpful-only; no inference-time mitigations
Scaffolding, prompting, chain of thought
The lab should mention some details so that observers can understand how powerful and optimized the scaffolding is. Open-sourcing scaffolding or sharing techniques is supererogatory. If the lab does not share its scaffolding, it should show that the scaffolding is effective by running the same model with the most powerful relevant open-source scaffolding, or running the model on evals like SWE-bench where existing scaffolding provides a baseline, and comparing the results.
Tools: often enable internet browser and code interpreter; enable other tools depending on the field or task
Permit many attempts (pass@n) when relevant to the threat model (e.g. for coding); otherwise, permit many attempts or use a weaker technique (especially best-of-n or self-consistency)
Look at transcripts to determine how common spurious failures are and fix them
Bonus: post-train on similar tasks
Forecasting: for each of the labs’ evals (or at least crucial or cheap evals), run on smaller/weaker models to get scaling laws and forecast performance as a function of effective training compute
Share with third-party evaluators
Offer to share with external evaluators (including UK AISI, US AISI, METR, and Apollo) pre-deployment
What access to share? At least helpful-only and no-inference-time-mitigations. Bonus: good fine-tuning & RL.
Let them publish their results, and ideally incorporate results into risk assessment
(Thresholds: for each threat model or area of dangerous capabilities, have a high-level capability threshold, and operationalize it as an eval score)
(At least while the safety case is that the model doesn’t have dangerous capabilities)
(This is hard; if a lab is unable to operationalize a capability threshold, it could operationalize a lower bound, to trigger reassessing capabilities and risks; it could also operationalize an upper bound, as a conservative commitment)
(Thresholds should have a safety buffer in case the lab is under-eliciting the capability, and sometimes to give warning months before the lab will reach the actual threshold, and sometimes to account for future elicitation advances)
(Out of scope here: thresholds should trigger good predetermined responses)
Have good evals: tasks should successfully measure capability in the relevant area
Have good threat models; have evals for all major risks, including autonomy, scheming or situational awareness, offensive cyber, biothreat uplift, and maybe manipulation or persuasion (especially like building rapport, manipulation, and deception—not like intervening on political opinions). Also use AI R&D evals as an early warning sign for AI-boosted AI research leading to rapid increases in dangerous capabilities.
Use high-quality open-source evals such as some DeepMind evals; InterCode-CTF, Cybench, or other CTFs; maybe some OpenAI evals; and maybe METR autonomy evals (this does not substitute for offering access to METR).
Using easier evals or weak elicitation is fine, if it’s done such that by the time the model crosses danger thresholds, the developer will be doing the full eval well.
These recommendations aren’t novel — they just haven’t been collected before.
How labs are doing
I made a table: DC evals: labs’ practices.
This post basically doesn’t consider some crucial factors in labs’ evals, especially the quality of particular evals, nor some adjacent factors, especially capability thresholds and planned responses to reaching capability thresholds. One good eval is better than lots of bad evals but I haven’t figured out which evals are good. What’s missing from this post—especially which evals are good, plus illegible elicitation stuff—may well be more important than the desiderata this post captures.
DeepMind > OpenAI ≥ Anthropic >>> Meta >> everyone else. The evals reports published by DeepMind, OpenAI, and Anthropic are similar. I only weakly endorse this ranking, and reasonable people disagree with it.
DeepMind, OpenAI, and Anthropic all have decent coverage of threat models; DeepMind is the best. They all do fine on sharing their methodology and reporting their results; DeepMind is the best. They’re all doing poorly on sharing with third-party evaluators before deployment; DeepMind is the worst. (They’re also all doing poorly on setting capability thresholds and planning responses, but that’s out-of-scope for this post.)
My main asks on evals for the three labs are similar: improve the quality of their evals, improve their elicitation, be more transparent about eval methodology, and share better access with a few important third-party evaluators. For OpenAI and Anthropic, I’m particularly interested in them developing (or adopting others’) evals for scheming or situational awareness capabilities, or committing to let Apollo evaluate their models and incorporate that into their risk assessment process.
(Meta does some evals but has limited scope, fails to share some details on methodology and results, and has very poor elicitation. Microsoft has a “deployment safety board” but it seems likely ineffective, and it doesn’t seem to have a plan to do evals. xAI, Amazon, and others seem to be doing nothing.)
Appendix: misc notes on best practices
This section is composed of independent paragraphs with various notes on best practices.
Publishing evals and explaining methodology has several benefits: it lets external observers check whether your evals are good, lets them suggest improvements, lets other labs adopt your evals, and can boost the general science of evals. The downsides are that evaluations could contain dangerous information—especially CBRN evals—and that publishing the evals could cause solutions to appear in future models’ training data. When the downsides are relevant, labs can achieve lots of the benefits with little of the downsides by sharing evals privately (with other labs, the government, and relevant external evaluators) and perhaps publishing a small semirandom subset of evals with transcripts. Optionally, labs can use the METR standard or Inspect. Publishing eval results informs observers about your model’s dangerous capabilities and provides accountability for interpreting results well and responding well. If you publish enough details, others can run your evals and compare to your scores for risk-assessment and sanity-checking purposes, in addition to noticing issues and helping you improve your evals. For human experiments: publish methodology details (such that a third party could basically replicate it, modulo access to model versions, posttraining, or scaffolding).
=====
Elicitation:
Scaffolding and prompting: maybe the team doing specialized benchmarks should also do general benchmarks (e.g. SWE-bench): if they get good scores on the general benchmarks, this is evidence that their prompting and scaffolding is good on the specialized benchmarks
Tools: multimodality: if models are not multimodal, they will perform poorly in areas requiring visual or audio input or output (e.g. visual input for physical science and engineering; perhaps audio input and output for persuasion). So capability in these areas may increase sharply if the model is replaced by a multimodal model or combined with relevant tools. This does not seem like a big deal; I think it’s fine to ignore.
Permit many attempts, best-of-n, or self-consistency: sometimes this makes sense given the field or threat model; e.g. in coding it’s often fine for the model to require several attempts. Additionally, this can provide a safety buffer or indicate near-future capability levels (including levels that will be reached via post-training enhancements, without retraining).
=====
Fix spurious failures: look at transcripts to identify why the agent fails and whether the failures are due to easily fixable mistakes or issues with the task or infrastructure. If so, fix them, or at least quantify how common they are. Example.
Separately from looking at the transcripts, fixing issues, and summarizing findings, ideally the labs would publish (a random subset of) transcripts.
=====
Using easier evals or weak elicitation is sometimes fine, if done right.[1]
=====
Third-party evals.
Third-party evals improve labs’ risk assessment and provide accountability. Additionally, informing UK AISI and US AISI about models’ capabilities could be valuable.
Who to offer access to? Government evaluators (UK AISI and US AISI) and organizations with expertise in eliciting model capabilities and evaluating for at least one dangerous capability (METR and Apollo). Perhaps labs should also offer access to organizations with just expertise in a dangerous capability, if the model is trained and set up such that the evaluators don’t need to be good at elicitation.
What access to share? A model that’s almost as powerful as the final model (including post-training), that’s helpful-only, with no inference-time mitigations. Bonus: the ability to do their own fine-tuning & RL.
Benefits: same reason such access is good for internal evals.
Cost is low and downside is low for helpful-only and no-inference-time-mitigations. Except maybe the lab has PR concerns: perhaps the lab doesn’t want the evaluator to say that it got the lab’s model to misbehave.
Evaluators should be allowed to publish their results.
=====
Labs should use high-quality open-source evals, such as some DeepMind evals (especially self-reasoning and CTF); InterCode-CTF, Cybench, or other CTFs; maybe some OpenAI evals; and maybe METR autonomy evals (this does not substitute for offering access to METR). When a lab doesn’t have an in-house eval for an area of capabilities, it can use others’ evals; even when it does, using others’ evals can improve its risk assessment and enable observers to understand the eval better and to predict future models’ performance.
=====
Which models should a lab evaluate?
All models that might be the most powerful in the domain. Sometimes no one model is best on all evals.
After reaching danger thresholds, evaluate as necessary to make safety cases applicable to specific deployment plans. For each kind of deployment (e.g. releasing weights, deploying via API with deep access and little monitoring, deploying via API with shallow access and strong monitoring, deploying just internally), evaluate the most powerful model that’s planned to be deployed that way.
When should a lab run evals? Different threats require different times.
During training for risk from weights being stolen
Before internal deployment for risk from scheming (faking alignment in order to gain power), especially of escape
Before external deployment for risk from misuse
During deployment as the lab’s post-training and elicitation improves, users’ elicitation improves, and the lab changes its deployment setup
How often to repeat evals during training and deployment? It depends on the rate of progress and the level of risk, and evaluating frequently substitutes with having a large safety buffer. But in a vacuum, every three months seems good.
Not testing the final model is fine if the safety buffer is sufficiently large, but if post-training improves performance substantially then the safety buffer needs to be very large or the lab needs to do great post-training on the non-final model
=====
Labs may keep elicitation techniques secret to preserve their advantage, but sharing such information seems fine in terms of safety. But for now this is moot: labs’ elicitation in evals for dangerous capabilities seems quite basic, not using secret powerful techniques.
=====
This post is about evals for dangerous capabilities. Some other kinds of evals are relevant to extreme risk too:
Dangerous capabilities elicitable when the model is deployed in a certain way
These evals are done with mitigations (both posttraining and inference-time stuff), but there should still be a safety buffer because users are better at elicitation than the red-team, because they can use future improvements and they have so much more total time. In particular, ideally the lab should either assume that a mitigation can be bypassed or demonstrate that it is sufficiently robust that (given the lab’s monitoring, access policy, and KYC) users can’t bypass it.
Danger from scheming, given a deployment protocol, assuming the model is scheming coherently
Propensity to fake alignment in order to gain power later and propensity to attempt to hack the lab or otherwise subvert safety measures if given the opportunity
Appendix: misc notes on particular evals
This is very nonexhaustive, shallow (just based on labs’ reports, not looking at particular tasks), and generally of limited insight. I’m writing it because I’m annoyed that when labs release or publish reports on their evals, nobody figures out whether the evals are good (or at least they don’t tell me). Perhaps this section will inspire someone to write a better version of it.
Google DeepMind:
Evals sources: evals paper, evals repo, Frontier Safety Framework, Gemini 1.5.
Existing comments: Ryan Greenblatt.
Evals:
Persuasion: the specific threat models seem good (and in particular more plausible than the change beliefs of reader/interlocutor threat model): evals include dialogues where the model attempts to persuade the user to give money to a charity, build trust and rapport, or covertly manipulate the user (e.g. get them to follow a suspicious link)
Offensive cyber CTFs: in-house CTF (repo) + “Hack The Box” + InterCode-CTF: seems reasonable
Offensive cyber: vulnerability detection: seems reasonable
Self-proliferation (repo): seems reasonable
Self-reasoning (repo): “knowledge gaps” + “instrumental self-modification” seems reasonable
CBRN: knowledge-based; in development and “still nascent”
DeepMind scored performance on each eval 0-4 (except CBRN), but doesn’t have predetermined thresholds, and at least some evals would saturate substantially below 4. DeepMind’s FSF has high-level “Critical Capability Levels” (CCLs); they feel pretty high; they use different categories from the evals described above (they’re in Autonomy, Biosecurity, Cybersecurity, and ML R&D).[2]
OpenAI:
Evals sources: o1 system card, Preparedness Framework, evals repo.
Evals:
Offensive cyber: CTFs: seems reasonable
Biological threats: uplift on questions (with bio experts and novices): I don’t know
Biological threats: automated evals: insufficient detail to assess
Persuasion: OpenAI’s original eval tested interventions on Americans’ political opinions, which seems like a relatively implausible threat model for persuasion, but its new evals involve more realistic manipulation targeting GPT-4o
Autonomy (related to “self-exfiltration, self-improvement, and resource acquisition”): they’re mostly normal coding tasks, but they include “agentic tasks” (which OpenAI also calls “autonomous replication and adaptation” tasks), but the details are unclear
OpenAI’s PF has high-level “risk levels”; they feel pretty high; they are not operationalized in terms of low-level evals.
Anthropic:
Evals sources: RSP evals report – Claude 3 Opus, Responsible Scaling Policy.
Evals:
Autonomous replication and adaption: tasks seem good
CBRN: uplift on knowledge-y questions (with domain experts): seems reasonable; insufficient detail to assess
Offensive cyber: InterCode-CTF seems good
Offensive cyber: vulnerability discovery + exploit development evals: seems reasonable; insufficient detail to assess
Meta:
Evals sources: Llama 3, CyberSecEval 3.
Evals:
Cyber automated evals (details): “Vulnerability identification challenges” + “Spear phishing benchmark” + “Attack automation framework”
I haven’t read this carefully.
Google’s Project Naptime observed that CyberSecEval 2 did no elicitation; Google used basic elicitation techniques to improve performance, including from 5% to 100% on one class of tests. CyberSecEval 3 acknowledges this but still does no elicitation.
Cyber: uplift (details): I don’t know
Chemical and biological weapons: uplift: “six-hour scenarios”; few details; results not reported
(Meta also does evals to measure a model’s default propensity for cyber-related undesired behaviors—such as writing insecure code or complying with requests to perform cyberattacks—rather than its capability. This has little relevance to risk, since undesired behavior like accidentally writing insecure code is not a large source of risk and refusing bad requests is easily circumvented, at least given the deployment strategy of releasing model weights.)
Appendix: reading list
Sources on evals and elicitation:[3]
Challenges in evaluating AI systems (Anthropic)
A new initiative for developing third-party model evaluations (Anthropic)
Sources on specific labs:
Google DeepMind
OpenAI
Anthropic
Meta
Microsoft
Thanks to several anonymous people for discussion and suggestions. They don’t necessarily endorse this post.
Crossposted from AI Lab Watch. Subscribe on Substack.
- ^
If a model is far from having a dangerous capability, full high-quality evals for that capability may be unnecessarily expensive. They may also be uninformative if the model scores close to zero and the eval struggles to detect small differences in the capability near the current level.
In this case, the developer can use a strictly easier version of the eval. The easier version will trigger before the actual version so there is no risk, and the easier version could be cheaper and more informative.
Or the developer could use a separate easier or cheaper yellow-line eval if its threshold is low enough that it is sure to trigger before the relevant threshold in the actual eval. Insofar as the relationship between performance on the two evals is unpredictable, the developer will have to set the yellow-line threshold lower. Hopefully in the future we will determine patterns in models’ performance on different evals, and this will let us say we’re quite sure that scoring <k% on an easy eval means you’ll score <x% on the real eval for smaller x.
(If the developer might be close to danger thresholds, then if it uses easier evals, the actual evals should be ready to go or the developer should be prepared to pause until doing them.)
Similarly, weak elicitation can be fine if the developer uses a sufficiently large safety buffer. But the upper bound on the power of core elicitation techniques (fine-tuning, chain of thought, basic tooling) is very high, so the developer basically has to use them. And the decent elicitation to excellent elicitation gap can be large and unpredictable.
(Ideal elicitation quality and safety buffer depends on the threat: misuse during intended deployment or the model being stolen, being leaked, or escaping. If the former, it also depends on users’ depth of access and whether post-deployment-correction is possible.)
- ^
Misc notes:
The FSF says:
> we will define a set of evaluations called “early warning evaluations,” with a specific “pass” condition that flags when a CCL may be reached before the evaluations are run again. We are aiming to evaluate our models every 6x in effective compute and for every 3 months of fine-tuning progress. To account for the gap between rounds of evaluation, we will design early warning evaluations to give us an adequate safety buffer before a model reaches a CCL.The evals paper proposes a CCL for self-proliferation and tentatively suggests an early warning trigger. But this isn’t in the FSF. And it says when a model meets this trigger, it is likely within 6x [] “effective compute” scaleup from the [] CCL, but a safety buffer should be almost certainly >6x effective compute from the CCL.
- ^
This list may be bad. You can help by suggesting improvements.
- Anthropic rewrote its RSP by 15 Oct 2024 14:25 UTC; 46 points) (
- What AI companies should do: Some rough ideas by 21 Oct 2024 14:00 UTC; 33 points) (
- Anthropic rewrote its RSP by 15 Oct 2024 14:30 UTC; 32 points) (EA Forum;
- UK AISI: Early lessons from evaluating frontier AI systems by 25 Oct 2024 19:00 UTC; 26 points) (
- Lab governance reading list by 25 Oct 2024 18:00 UTC; 20 points) (
- What AI companies should do: Some rough ideas by 21 Oct 2024 14:00 UTC; 14 points) (EA Forum;
Correction, October 15: this is wrong; the policy was ambiguous but it didn’t really commit to the thing I thought it did. Mea culpa.
Footnote to table cells D18 and D19:
My reading of Anthropic’s ARA threshold, which nobody has yet contradicted:
The RSP defines/operationalizes 50% of ARA tasks (10% of the time) as a sufficient condition for ASL-3.
(Sidenote: I think the literal reading is that this is an ASL-3 definition, but I assume it’s supposed to be an operationalization of the safety buffer, 6x below ASL-3.[1])The May RSP evals report suggests 50% of ARA tasks (10% of the time) is merely a yellow line. (Pages 2 and 6, plus page 4 says “ARA Yellow Lines are clearly defined in the RSP” but the RSP’s ARA threshold was not just a yellow line.)
This is not kosher; Anthropic needs to formally amend the RSP to [raise the threshold / make the old threshold no longer binding].
(It’s totally plausible that the old threshold was too low, but the solution is to raise it officially, not pretend that it’s just a yellow line.)
(The forthcoming RSP update will make old thresholds moot; I’m just concerned that Anthropic ignored the RSP in the past.)
(Anthropic didn’t cross the threshold and so didn’t violate the RSP — it just ignored the RSP by implying that if it crossed the threshold it wouldn’t necessarily respond as required by the RSP.)
Update: another part of the RSP says this threshold implements the safety buffer.
Potential for o1-like long horizon reasoning post-training compromises capability evals for open weights models. If it’s not applied before/during evals, then evals will significantly underestimate capabilities of the system that can be built out of the open weights later, when the recipe is reproduced.
This likely will be relevant for Llama 4 (as the first 1e26+ FLOPs model with open weights), if they don’t manage a good reproduction of o1-like post-training before release, and continue the policy of publishing in open weights if the evals are not too alarming.
Another possible best practice for evals: use human+AI rather than [edit: just] AI alone. Many threat models involve human+AI and sometimes human+AI is substantially stronger than human alone and AI alone.
You mean “in addition to”, right? Knowing what the AI alone is capable of doing is quite an important part of what evals are about, so keeping it there seems crucial.
Correction:
On whether Anthropic uses chain-of-thought, I said “Yes, but not explicit, but implied by discussion of elicitation in the RSP and RSP evals report.” This impression was also based on conversations with relevant Anthropic staff members. Today, Anthropic says “Some of our evaluations lacked some basic elicitation techniques such as best-of-N or chain-of-thought prompting.” Anthropic also says it believes the elicitation gap is small, in tension with previous statements.
I don’t think that ‘post-train on similar tasks’ should be considered just a bonus. I think that that’s a key part of adequate safety testing. Fine-tuning on similar tasks has a substantial history in ML literature when it comes to evaluating the max capability of a general model on a specific task. It is pretty standard to report variations of: zero-examples (aka zero-shot), n-examples (aka n-shot), 1-attempt, n-attempts (with a resolution scheme such as majority solution gets submitted), fine-tuning on similar task (or subset of the examples for this task).
This isn’t some weird above-and-beyond demand, it’s a standard technique used for assessing capabilities. I would go so far as to say that I would suspect that someone who didn’t try this didn’t actually want to elicit the full capabilities of the model.
The justification for fine-tuning not being a part of the reported assessment of general purpose models is that you want to measure what users will be expected to experience as they interact with the model. But even closed-weight API-only models often offer a fine-tuning API. And definitely if you are trying to assess the risk of the weights being stolen, you need to consider fine-tuning.
Addendum:
I think it’d be great if we sorted dangerous capabilities evals into two categories, mitigated and unmitigated hazards.
Mitigated means the hazards you measure with enforceable mitigations in place, as in behind a secure API. This includes:
API-based fine-tuning, where you filter the data that customers are allowed to fine-tune on and/or put various restrictions in place on the fine-tuning process.
Limited prompt-only jailbreaks (including long context many-example jailbreaks). This could include requiring that the jail-breaking needs to evade a filter which is trying to catch and block users that try to jailbreak.
Note that in the context of dangerous capabilities evals, jail-breaking can look like ‘refusal dodging’. Refusal dodging is when you try to justify your question about dangerous technology by placing it in a reasonable context, such as a student studying for an exam, or analyzing an academic paper. If the model will summarize and extract key information from hazardous academic papers, that should be a red flag on a capability eval.
Red-teamers may try to sneak past filters by, for example, fine-tuning the model to communicate in code, using purely innocent statements in plain-text. It’s fair game for the developers to put filters in place to try to catch red-teamers attempting to do this. See https://arxiv.org/html/2406.20053v1
This sounds like it’s relatively easy-mode, and indeed it should be. But I still want to see a separate report on this, since it reassures me that the developer in question is taking reasonable precautions and has implemented them competently.
Unmitigated means the hazards you measure as if the weights had been stolen, or you deliberately released the weights (looking at you Meta). Unmitigated includes:
Unlimited unfiltered jail-breaking attempts.
White-box jail-breaking, where you get to use an optimization process working against the activations of the model to avoid refusals. (white box attacks are generally harder to resist than black-box attacks).
Activation steering for non-refusal or for capabilities elicitation
Fine-tuning on task-specific domain knowledge and examples of alignment to terrorist agendas
Merging the model with other models or architectures
It’s fair game for the company to do anti-fine-tuning modifications of the model before doing the unmitigated testing. It’s not fair game for the company to restrict what the red-teamers do to try to undo such measures. See https://arxiv.org/abs/2405.14577
I think this distinction is important, since I don’t think that any companies so far have been good at publishing cleanly separated scores for mitigated and unmitigated. What OpenAI called ‘unmitigated’ is what I would call ‘first pass mitigated, before the red teamers showed us how many holes our mitigations still had and we fixed those specific issues’. That’s also an interesting measure, but not nearly as informative as a true unmitigated eval score.
@Zach Stein-Perlman
Out of curiosity, do you have any thoughts on the importance / feasibility of formal verification / mathematically “provable” safety based approaches in these evals you mention?
No. But I’m skeptical: seems hard to imagine provable safety, much less competitive with the default path to powerful AI, much less how post-hoc evals are relevant.
I would think sharing the scaffolding would be important. Stronger scaffolding could skew evaluation results. From the complete paragraph you seem to suggest that sufficient information of the scaffolding should be published, so I’m curious what you mean here.
Stronger scaffolding makes evals better.
I think labs should at least demonstrate that their scaffolding is at least as good as some baseline. If there’s no established baseline scaffolding for the eval, they can say we did XYZ and we got n%, our secret scaffolding does better; when there is an established baseline scaffolding, they can compare to that (e.g. the best scaffold for SWE-bench Verified is called Agentless; in the o1 system card, OpenAI reported results from running its models in this scaffold.)