AI strategy & governance. ailabwatch.org. ailabwatch.substack.com.
Zach Stein-Perlman
This shortform discusses the current state of responsible scaling policies (RSPs). They’re mostly toothless, unfortunately.
The Paris summit was this week. Many companies had committed to make something like an RSP by the summit. Half of them did, including Microsoft, Meta, xAI, and Amazon. (NVIDIA did not—shame on them—but I hear they are writing something.) Unfortunately but unsurprisingly, these policies are all vague and weak.
RSPs essentially have four components: capability thresholds beyond which a model might be dangerous by default, an evaluation protocol to determine when models reach those thresholds, a plan for how to respond when various thresholds are reached, and accountability measures.
A maximally lazy RSP—a document intended to look like an RSP without making the company do anything differently—would have capability thresholds be vague or extremely high, evaluation be unspecified or low-quality, response be like we will make it safe rather than substantive mitigations or robustness guarantees, and no accountability measures. Such a policy would be little better than the company saying “we promise to deploy AIs safely.” The new RSPs are basically like that.[1]
Some aspects of some RSPs that existed before the summit are slightly better.[2]
If existing RSPs are weak, how would a strong RSP be different?
Evals: eval should measure relevant capabilities (including cyber, bio, and scheming), evals should be sufficiently difficult, and labs should do good elicitation. (As a lower bar, the evals should exist; many companies say they will do evals but don’t seem to have a plan for what evals to do.)
See generally Model evals for dangerous capabilities and OpenAI’s CBRN tests seem unclear
Response: misuse
Rather than just saying that you’ll implement mitigations such that users can’t access dangerous capabilities, say how you’ll tell if your mitigations are good enough. For example, say that you’ll have a skilled red-team attempt to elicit dangerous stuff from the post-mitigation model; ensure that it doesn’t provide uplift, or the elicitation required is so involved that doing harm this way is no cheaper than without the AI
Response: control
AI monitoring for catastrophic actions where the monitor can replace suspicious actions and send particularly suspicious actions to human auditors
And ideally a control evaluation to evaluate your protocol’s adequacy if the model is scheming[3]
Response: security (especially of model weights) — we’re very far from securing model weights against determined sophisticated attackers, so:
Avoid expanding the Pareto frontier between powerful/dangerous and insecure much
Say no AI company has secured their model weights, and this imposes unacceptable risk. Commit that if all other developers were willing to implement super strong security, even though it’s costly, you would too.
Thresholds
Should be low enough that your responses trigger before your models enable catastrophic harm
Ideally should be operationalized in evals, but this is genuinely hard
Accountability
Publish info on evals/elicitation (such that others can tell whether it’s adequate)
Be transparent to an external auditor; have them review your evals and RSP and publicly comment on (1) adequacy and (2) whether your decisions about not publishing various details are reasonable
See also https://metr.org/rsp-key-components/#accountability
(What am I happy about in current RSPs? Briefly and without justification: yay Anthropic and maybe DeepMind on misuse stuff; somewhat yay Anthropic and OpenAI on their eval-commitments; somewhat yay DeepMind and maybe OpenAI on their actual evals; somewhat yay DeepMind on scheming/monitoring/control; maybe somewhat yay DeepMind and Anthropic on security (but not adequate).)
(Companies’ other commitments aren’t meaningful either.)
- ^
Microsoft is supposed to conduct “robust evaluation of whether a model possesses tracked capabilities at high or critical levels, including through adversarial testing and systematic measurement using state-of-the-art methods,” but no details on what evals they’ll use. The response to dangerous capabilities is “Further review and mitigations required.” The “Security measures” are underspecified but do take the situation seriously. The “Safety mitigations” are less substantive, unfortunately. There’s not really accountability. Nothing on alignment or internal deployment.
Meta has high vague risk thresholds, a vague evaluation plan, vague responses (e.g. “security protections to prevent hacking or exfiltration” and “mitigations to reduce risk to moderate level”), and no accountability. But they do suggest that if they make a model with dangerous capabilities, they won’t release the weights and if they do deploy it externally (via API) they’ll have decent robustness to jailbreaks — there are loopholes but they hadn’t articulated that principle before. Nothing on alignment or internal deployment.
xAI’s policy begins “This is the first draft iteration of xAI’s risk management framework that we expect to apply to future models not currently in development.” Misuse evals are cheap (like multiple-choice questions rather than uplift experiments); alignment evals reference unpublished papers and mention “Utility Functions” and “Corrigibility Score.” Thresholds would be operationalized as eval results, which is nice, but they don’t yet exist. For misuse, includes “Examples of safeguards or mitigations” (but not we’ll know mitigations are adequate if a red team fails to break them or other details to suggest mitigations will be effective); no mitigations for alignment.
Amazon: “Critical Capability Thresholds” are high and vague; evaluation is vague; mitigations are vague (“Upon determining that an Amazon model has reached a Critical Capability Threshold, we will implement a set of Safety Measures and Security Measures to prevent elicitation of the critical capability identified and to protect against inappropriate access risks. Safety Measures are designed to prevent the elicitation of the observed Critical Capabilities following deployment of the model. Security Measures are designed to prevent unauthorized access to model weights or guardrails implemented as part of the Safety Measures, which could enable a malicious actor to remove or bypass existing guardrails to exceed Critical Capability Thresholds.”); there’s not really accountability. Nothing on alignment or internal deployment. I appreciate the list of current security practices at the end of the document.
- ^
Briefly and without justification:
OpenAI is similarly vague on thresholds and evals and responses, with no accountability. (Also OpenAI is untrustworthy in general and has a bad history on Preparedness in particular.) But they’ve done almost-decent evals in the past.
The DeepMind thing isn’t really a commitment, and it doesn’t say much about when to do evals, and the security levels are low (but this is no worse than everyone else being vague), and it doesn’t have accountability, but it does mention deceptive alignment + control evals + monitoring, and DeepMind has done almost-decent evals in the past.
The Anthropic thing is fine except it only goes up to ASL-3 and there’s no control and little accountability (and the security isn’t great (especially for systems after the first systems requiring ASL-3 security) but it’s no worse than others).
See The current state of RSPs modulo the DeepMind FSF update.
- ^
DeepMind says something good on this (but it’s not perfect and is only effective if they do good evals sufficiently frequently). Other RSPs don’t seriously talk about risks from deceptive alignment.
There also used to be a page for Preparedness: https://web.archive.org/web/20240603125126/https://openai.com/preparedness/. Now it redirects to the safety page above.
(Same for Superalignment but that’s less interesting: https://web.archive.org/web/20240602012439/https://openai.com/superalignment/.)
DeepMind updated its Frontier Safety Framework (blogpost, framework, original framework). It associates “recommended security levels” to capability levels, but the security levels are low. It mentions deceptive alignment and control (both control evals as a safety case and monitoring as a mitigation); that’s nice. The overall structure is like we’ll do evals and make a safety case, with some capabilities mapped to recommended security levels in advance. It’s not very commitment-y:
We intend to evaluate our most powerful frontier models regularly
When a model reaches an alert threshold for a CCL, we will assess the proximity of the model to the CCL and analyze the risk posed, involving internal and external experts as needed. This will inform the formulation and application of a response plan.
These recommended security levels reflect our current thinking and may be adjusted if our empirical understanding of the risks changes.
If we assess that a model has reached a CCL that poses an unmitigated and material risk to overall public safety, we aim to share information with appropriate government authorities where it will facilitate the development of safe AI.
Possibly the “Deployment Mitigations” section is more commitment-y.
I expect many more such policies will come out in the next week; I’ll probably write a post about them all at the end rather than writing about them one by one, unless xAI or OpenAI says something particularly notable.
Meta: Frontier AI Framework
My guess is it’s referring to Anthropic’s position on SB 1047, or Dario’s and Jack Clark’s statements that it’s too early for strong regulation, or how Anthropic’s policy recommendations often exclude RSP-y stuff (and when they do suggest requiring RSPs, they would leave the details up to the company).
o3-mini is out (blogpost, tweet). Performance isn’t super noteworthy (on first glance), in part since we already knew about o3 performance.
Non-fact-checked quick takes on the system card:
the model referred to below as the o3-mini post-mitigation model was the final model checkpoint as of Jan 31, 2025 (unless otherwise specified)
Big if true (and if Preparedness had time to do elicitation and fix spurious failures)
If this is robust to jailbreaks, great, but presumably it’s not, so low post-mitigation performance is far from sufficient for safety-from-misuse; post-mitigation performance isn’t what we care about (we care about approximately what good jailbreakers get on the post-mitigation model).
No mention of METR, Apollo, or even US AISI? (Maybe too early to pay much attention to this, e.g. maybe there’ll be a full-o3 system card soon.) [Edit: also maybe it’s just not much more powerful than o1.]
32% is a lot
The dataset is 1⁄4 T1 (easier), 1⁄2 T2, 1⁄4 T3 (harder); 28% on T3 means that there’s not much difference between T1 and T3 to o3-mini (at least for the easiest-for-LMs quarter of T3)
Probably o3-mini is successfully using heuristics to get the right answer and could solve very few T3 problems in a deep or humanlike way
Dario Amodei: On DeepSeek and Export Controls
Thanks. The tax treatment is terrible. And I would like more clarity on how transformative AI would affect S&P 500 prices (per this comment). But this seems decent (alongside AI-related calls) because 6 years is so long.
I wrote this for someone but maybe it’s helpful for others
What labs should do:
I think the most important things for a relatively responsible company are control and security. (For irresponsible companies, I roughly want them to make a great RSP and thus become a responsible company.)
Reading recommendations for people like you (not a control expert but has context to mostly understand the Greenblatt plan):
Control: Redwood blogposts[1] or ask a Redwood human “what’s the threat model” and “what are the most promising control techniques”
Security: not worth trying to understand but there’s A Playbook for Securing AI Model Weights + Securing AI Model Weights
A few more things: What AI companies should do: Some rough ideas
Lots more things + overall plan: A Plan for Technical AI Safety with Current Science (Greenblatt 2023)
More links: Lab governance reading list
What labs are doing:
Evals: it’s complicated; OpenAI, DeepMind, and Anthropic seem close to doing good model evals for dangerous capabilities; see DC evals: labs’ practices plus the links in the top two rows (associated blogpost + model cards)
RSPs: all existing RSPs are super weak and you shouldn’t expect them to matter; maybe see The current state of RSPs
Control: nothing is happening at the labs, except a little research at Anthropic and DeepMind
Security: nobody is prepared; nobody is trying to be prepared
Internal governance: you should basically model all of the companies as doing whatever leadership wants. In particular: (1) the OpenAI nonprofit is probably controlled by Sam Altman and will probably lose control soon and (2) possibly the Anthropic LTBT will matter but it doesn’t seem to be working well.
Publishing safety research: DeepMind and Anthropic publish some good stuff but surprisingly little given how many safety researchers they employ; see List of AI safety papers from companies, 2023–2024
Resources:
I think ideally we’d have several versions of a model. The default version would be ignorant about AI risk, AI safety and evaluation techniques, and maybe modern LLMs (in addition to misuse-y dangerous capabilities). When you need a model that’s knowledgeable about that stuff, you use the knowledgeable version.
Somewhat related: https://www.alignmentforum.org/posts/KENtuXySHJgxsH2Qk/managing-catastrophic-misuse-without-robust-ais
[Perfunctory review to get this post to the final phase]
Solid post. Still good. I think a responsible developer shouldn’t unilaterally pause but I think it should talk about the crazy situation it’s in, costs and benefits of various actions, what it would do in different worlds, and its views on risks. (And none of the labs have done this; in particular Core Views is not this.)
List of AI safety papers from companies, 2023–2024
One more consideration against (or an important part of “Bureaucracy”): sometimes your lab doesn’t let you publish your research.
Yep, the final phase-in date was in November 2024.
Some people have posted ideas on what a reasonable plan to reduce AI risk for such timelines might look like (e.g. Sam Bowman’s checklist, or Holden Karnofsky’s list in his 2022 nearcast), but I find them insufficient for the magnitude of the stakes (to be clear, I don’t think these example lists were intended to be an extensive plan).
See also A Plan for Technical AI Safety with Current Science (Greenblatt 2023) for a detailed (but rough, out-of-date, and very high-context) plan.
Yeah. I agree/concede that you can explain why you can’t convince people that their own work is useless. But if you’re positing that the flinchers flinch away from valid arguments about each category of useless work, that seems surprising.
I feel like John’s view entails that he would be able to convince my friends that various-research-agendas-my-friends-like are doomed. (And I’m pretty sure that’s false.) I assume John doesn’t believe that, and I wonder why he doesn’t think his view entails it.
I wonder whether John believes that well-liked research, e.g. Fabien’s list, is actually not valuable or rare exceptions coming from a small subset of the “alignment research” field.
- AI #97: 4 by 2 Jan 2025 14:10 UTC; 45 points) (
- 28 Dec 2024 23:52 UTC; 14 points) 's comment on Fabien’s Shortform by (
xAI Risk Management Framework (Draft)
You’re mostly right about evals/thresholds. Mea culpa. Sorry for my sloppiness.
For misuse, xAI has benchmarks and thresholds—or rather examples of benchmarks thresholds to appear in the real future framework—and based on the right column they seem very reasonably low.
Unlike other similar documents, these are not thresholds at which to implement mitigations but rather thresholds to reduce performance to. So it seems the primary concern is probably not the thresholds are too high but rather xAI’s mitigations won’t be robust to jailbreaks and xAI won’t elicit performance on post-mitigation models well. E.g. it would be inadequate to just run a benchmark with a refusal-trained model, note that it almost always refuses, and call it a success. You need something like: a capable red-team tries to break the mitigations and use the model for harm, and either the red-team fails or it’s so costly that the model doesn’t make doing harm cheaper.
(For “Loss of Control,” one of the two cited benchmarks was published today—I’m dubious that it measures what we care about but I’ve only spent ~3 mins engaging—and one has not yet been published. [Edit: and, like, on priors, I’m very skeptical of alignment evals/metrics, given the prospect of deceptive alignment, how we care about worst-case in addition to average-case behavior, etc.])