Create model evals for warning signs of “Critical Capability Levels”
Evals should have a “safety buffer” of at least 6x effective compute so that CCLs will not be reached between evals
They list 7 CCLs across “Autonomy, Biosecurity, Cybersecurity, and Machine Learning R&D,” and they’re thinking about CBRN
E.g. “Autonomy level 1: Capable of expanding its effective capacity in the world by autonomously acquiring resources and using them to run and sustain additional copies of itself on hardware it rents”
Do model evals every 6x effective compute and every 3 months of fine-tuning
“When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan based on the analysis of the CCL and evaluation results. We will also take into account considerations such as additional risks flagged by the review and the deployment context.” The document briefly describes 5 levels of security mitigations and 4 levels of deployment mitigations.
The mitigations aren’t yet connected to eval results or other triggers; there are no advance commitments about safety practices
The FSF doesn’t contain commitments. The blogpost says “The Framework is exploratory and we expect it to evolve significantly” and “We aim to have this initial framework fully implemented by early 2025.” The document says similar things. It uses the word “aim” a lot and the word “commit” never. The FSF basically just explains a little about DeepMind’s plans on dangerous capability evals. Those details do seem reasonable. (This is unsurprising given their good dangerous capability evals paper two months ago, but it’s good to hear about evals in a DeepMind blogpost rather than just a paper by the safety team.)
(Ideally companies would both make hard commitments and talk about what they expect to do, clearly distinguishing between these two kinds of statements. Talking about plans like this is helpful. But with no commitments, DeepMind shouldn’t get much credit.)
(The FSF is not precise enough to be possible to commit to — DeepMind could commit to doing the model evals regularly, but it doesn’t discuss specific mitigations as a function of risk assessment results.[2])
The document doesn’t specify whether “deployment” includes internal deployment. (This is important because maybe lots of risk comes from the lab using AIs internally to do AI development.) Standard usage suggests internal deployment is excluded, and the focus on misuse and related cues also suggest it’s excluded, but the mention of ML R&D as a dangerous capability suggests it’s included. [Update.[3]]
The document says “We expect it to evolve substantially as our understanding of the risks and benefits of frontier models improves, and we will publish substantive revisions as appropriate” and a few similar things. The document doesn’t say how it will be revised/amended, which isn’t surprising, since it doesn’t make formal commitments.
No external evals or accountability, but they’re “exploring” it.
Public accountability: unfortunately, there’s no mention of releasing eval results or even announcing when thresholds are reached. They say “We are exploring internal policies around alerting relevant stakeholder bodies when, for example, evaluation thresholds are met.”[4]
But they did publish evals and eval results for their recent releases (1, 2)
Brief mention of alignment and control: “protection against the risk of systems acting adversarially against humans may require additional Framework components, including new evaluations and control mitigations that protect against adversarial AI activity.”
Security mitigations:
Mostly they discuss developers’ access to the weights.
This is disappointing. It’s important but lots of other stuff is important too. Fortunately they mention how their levels correspond to the levels in the forthcoming RAND report on securing model weights, which is much more exhaustive.
Some people get unilateral access to weights until the top level. This is disappointing. It’s been almost a year since Anthropic said it was implementing two-party control, where nobody can unilaterally access the weights.[5] (But reducing the number of people with access to the weights is less important than ensuring that when someone does access the weights they can’t email them away or something.)
They say the status quo is level 0 out of 4. Everyone says Google has great security; I’d assumed they were already around their level 2 or 3. Alas.
Their deployment mitigation levels are about preventing misuse of dangerous capabilities; this is fine but there’s basically no details.
CCLs:
Nothing on how they’ll operationalize (warning signs for) the CCLs.
The categories are good. The CCL definitions aren’t detailed, but they do a good job of capturing the correct dimensions of risks within each category. The thresholds feel high — 6⁄7 of the CCLs feel like the capabilities would be a Really Big Deal in prosaic terms, and ~4 feel like a big deal for x-risk. But you can’t say whether the thresholds are “too high” without corresponding safety mitigations, which this document doesn’t have.
Note that the “level 1” mitigations don’t correspond to the “level 1″ CCLs; the levels are totally independent.
Maybe this document was rushed because DeepMind wanted to get something out before the AI Seoul Summit next week. I’ve heard that the safety team has better and more detailed plans. Hopefully some of those get published in DeepMind’s voice (e.g. posted on the DeepMind blog or pronounced by DeepMind leadership) soon. Hopefully the bottleneck is polishing those plans, not weakening them to overcome a veto from DeepMind leadership.
Reminder of how other labs are doing on RSPs, briefly (I feel very comfortable about these claims, but I omit justification and there’s not a consensus on these claims):
AnthropicRSP: a promising RSP so far but missing the all-important evals and safety practices for ASL-4.
OpenAIbeta Preparedness Framework: an unpromising beta RSP — nonexhaustively, the capability thresholds are way too high, and there’s nothing about internal deployment, and they might be required to share models with Microsoft despite these commitments, and they failed to share the risk-scorecard when releasing GPT-4o.
Meta has nothing like an RSP. With Llama 3, they have started doing red-teaming and model evals for dangerous capabilities: CBRNE and some cyber capabilities. They have not made safety commitments based on risk assessment results.
Microsoftsays “When it comes to frontier model deployment, Microsoft and OpenAI have together defined capability thresholds that act as a trigger to review models in advance of their first release or downstream deployment. The scope of a review, through our joint Microsoft-OpenAI Deployment Safety Board (DSB), includes model capability discovery.” This sounds good, but Microsoft has not elaborated on these capability thresholds, shared details about the DSB, or shared details about past reviews. So those details may be quite bad and reviews may not even be about real dangerous capabilities.
But it says they plan to: “As we better understand the risks posed by models at different CCLs, and the contexts in which our models will be deployed, we will develop mitigation plans that map the CCLs to the security and deployment levels described.” But maybe only after the thresholds are crossed: “When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan.”
The full sentence doesn’t parse: “We are exploring internal policies around alerting relevant stakeholder bodies when, for example, evaluation thresholds are met, and in some cases mitigation plans as well as post-mitigation outcomes.” What about mitigation plans?
DeepMind’s “Frontier Safety Framework” is weak and unambitious
FSF blogpost. Full document (just 6 pages; you should read it). Compare to Anthropic’s RSP, OpenAI’s RSP (“Preparedness Framework”), and METR’s Key Components of an RSP.
Google DeepMind’s FSF has three steps:
Create model evals for warning signs of “Critical Capability Levels”
Evals should have a “safety buffer” of at least 6x effective compute so that CCLs will not be reached between evals
They list 7 CCLs across “Autonomy, Biosecurity, Cybersecurity, and Machine Learning R&D,” and they’re thinking about CBRN
E.g. “Autonomy level 1: Capable of expanding its effective capacity in the world by autonomously acquiring resources and using them to run and sustain additional copies of itself on hardware it rents”
Do model evals every 6x effective compute and every 3 months of fine-tuning
This is an “aim,” not a commitment
Nothing about evals during deployment [update[1]]
“When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan based on the analysis of the CCL and evaluation results. We will also take into account considerations such as additional risks flagged by the review and the deployment context.” The document briefly describes 5 levels of security mitigations and 4 levels of deployment mitigations.
The mitigations aren’t yet connected to eval results or other triggers; there are no advance commitments about safety practices
The FSF doesn’t contain commitments. The blogpost says “The Framework is exploratory and we expect it to evolve significantly” and “We aim to have this initial framework fully implemented by early 2025.” The document says similar things. It uses the word “aim” a lot and the word “commit” never. The FSF basically just explains a little about DeepMind’s plans on dangerous capability evals. Those details do seem reasonable. (This is unsurprising given their good dangerous capability evals paper two months ago, but it’s good to hear about evals in a DeepMind blogpost rather than just a paper by the safety team.)
(Ideally companies would both make hard commitments and talk about what they expect to do, clearly distinguishing between these two kinds of statements. Talking about plans like this is helpful. But with no commitments, DeepMind shouldn’t get much credit.)
(The FSF is not precise enough to be possible to commit to — DeepMind could commit to doing the model evals regularly, but it doesn’t discuss specific mitigations as a function of risk assessment results.[2])
Misc notes (but you should really read the doc yourself):
The document doesn’t specify whether “deployment” includes internal deployment. (This is important because maybe lots of risk comes from the lab using AIs internally to do AI development.) Standard usage suggests internal deployment is excluded, and the focus on misuse and related cues also suggest it’s excluded, but the mention of ML R&D as a dangerous capability suggests it’s included. [Update.[3]]
No mention of evals during deployment (to account for improvements in scaffolding, prompting, etc.). [Update.[1]]
The document says “We expect it to evolve substantially as our understanding of the risks and benefits of frontier models improves, and we will publish substantive revisions as appropriate” and a few similar things. The document doesn’t say how it will be revised/amended, which isn’t surprising, since it doesn’t make formal commitments.
No external evals or accountability, but they’re “exploring” it.
Public accountability: unfortunately, there’s no mention of releasing eval results or even announcing when thresholds are reached. They say “We are exploring internal policies around alerting relevant stakeholder bodies when, for example, evaluation thresholds are met.”[4]
But they did publish evals and eval results for their recent releases (1, 2)
Brief mention of alignment and control: “protection against the risk of systems acting adversarially against humans may require additional Framework components, including new evaluations and control mitigations that protect against adversarial AI activity.”
Security mitigations:
Mostly they discuss developers’ access to the weights.
This is disappointing. It’s important but lots of other stuff is important too. Fortunately they mention how their levels correspond to the levels in the forthcoming RAND report on securing model weights, which is much more exhaustive.
Some people get unilateral access to weights until the top level. This is disappointing. It’s been almost a year since Anthropic said it was implementing two-party control, where nobody can unilaterally access the weights.[5] (But reducing the number of people with access to the weights is less important than ensuring that when someone does access the weights they can’t email them away or something.)
They say the status quo is level 0 out of 4. Everyone says Google has great security; I’d assumed they were already around their level 2 or 3. Alas.
Their deployment mitigation levels are about preventing misuse of dangerous capabilities; this is fine but there’s basically no details.
CCLs:
Nothing on how they’ll operationalize (warning signs for) the CCLs.
The categories are good. The CCL definitions aren’t detailed, but they do a good job of capturing the correct dimensions of risks within each category. The thresholds feel high — 6⁄7 of the CCLs feel like the capabilities would be a Really Big Deal in prosaic terms, and ~4 feel like a big deal for x-risk. But you can’t say whether the thresholds are “too high” without corresponding safety mitigations, which this document doesn’t have.
Note that the “level 1” mitigations don’t correspond to the “level 1″ CCLs; the levels are totally independent.
Maybe this document was rushed because DeepMind wanted to get something out before the AI Seoul Summit next week. I’ve heard that the safety team has better and more detailed plans. Hopefully some of those get published in DeepMind’s voice (e.g. posted on the DeepMind blog or pronounced by DeepMind leadership) soon. Hopefully the bottleneck is polishing those plans, not weakening them to overcome a veto from DeepMind leadership.
Reminder of how other labs are doing on RSPs, briefly (I feel very comfortable about these claims, but I omit justification and there’s not a consensus on these claims):
Anthropic RSP: a promising RSP so far but missing the all-important evals and safety practices for ASL-4.
OpenAI beta Preparedness Framework: an unpromising beta RSP — nonexhaustively, the capability thresholds are way too high, and there’s nothing about internal deployment, and they might be required to share models with Microsoft despite these commitments, and they failed to share the risk-scorecard when releasing GPT-4o.
Meta has nothing like an RSP. With Llama 3, they have started doing red-teaming and model evals for dangerous capabilities: CBRNE and some cyber capabilities. They have not made safety commitments based on risk assessment results.
Microsoft says “When it comes to frontier model deployment, Microsoft and OpenAI have together defined capability thresholds that act as a trigger to review models in advance of their first release or downstream deployment. The scope of a review, through our joint Microsoft-OpenAI Deployment Safety Board (DSB), includes model capability discovery.” This sounds good, but Microsoft has not elaborated on these capability thresholds, shared details about the DSB, or shared details about past reviews. So those details may be quite bad and reviews may not even be about real dangerous capabilities.
Update: a DeepMind senior staff member says the 3-month condition includes during deployment. Yay.
But it says they plan to: “As we better understand the risks posed by models at different CCLs, and the contexts in which our models will be deployed, we will develop mitigation plans that map the CCLs to the security and deployment levels described.” But maybe only after the thresholds are crossed: “When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan.”
Update: a DeepMind senior staff member says “deployment” means external deployment.
The full sentence doesn’t parse: “We are exploring internal policies around alerting relevant stakeholder bodies when, for example, evaluation thresholds are met, and in some cases mitigation plans as well as post-mitigation outcomes.” What about mitigation plans?
See Frontier Model Security. But Anthropic hasn’t announced that it has successfully implemented this.