We’ve been consulting with several parties1 on responsible scaling2 policies (RSPs). An RSP specifies what level of AI capabilities an AI developer is prepared to handle safely with their current protective measures, and conditions under which it would be too dangerous to continue deploying AI systems and/or scaling up AI capabilities until protective measures improve.
We think RSPs are one of the most promising paths forward for reducing risks of major catastrophes from AI. We’re excited to advance the science of model evaluations to help labs implement RSPs that reliably prevent dangerous situations, but aren’t unduly burdensome and don’t prevent development when it’s safe.
This page will explain the basic idea of RSPs as we see it, then discuss:
Why we think RSPs are promising. In brief (more below):
A pragmatic middle ground. RSPs offer a potential middle ground between (a) those who think AI could be extremely dangerous and seek things like moratoriums on AI development, and (b) who think that it’s too early to worry about capabilities with catastrophic potential. RSPs are pragmatic and threat-model driven: rather than arguing over the likelihood of future dangers, we can implement RSPs that commit to measurement (e.g., evaluations) and empirical observation—pausing deployment and/or development if specific dangerous AI capabilities emerge, until protective measures are good enough to handle them safely.
Knowing which protective measures to prioritize. RSPs can help AI developers move from broad caution-oriented principles to specific commitments, giving a framework for which protective measures (information security; refusing harmful requests; alignment research; etc.) they need to prioritize to safely continue development and deployment.
Evals-based rules and norms. In the longer term, we’re excited about evals-based AI rules and norms more generally: requirements that AI systems be evaluated for dangerous capabilities, and restricted when that’s the only way to contain the risks (until protective measures improve). This could include standards, third-party audits, and regulation. Voluntary RSPs can happen quickly, and provide a testbed for processes and techniques that can be adopted to make future evals-based regulatory regimes work well.
What we see as the key components of a good RSP, with sample language for each component. In brief, we think a good RSP should cover:
Limits: which specific observations about dangerous capabilities would indicate that it is (or strongly might be) unsafe to continue scaling?
Protections: what aspects of current protective measures are necessary to contain catastrophic risks?
Evaluation: what are the procedures for promptly catching early warning signs of dangerous capability limits?
Response: if dangerous capabilities go past the limits and it’s not possible to improve protections quickly, is the AI developer prepared to pause further capability improvements until protective measures are sufficiently improved, and treat any dangerous models with sufficient caution?
Accountability: how does the AI developer ensure that the RSP’s commitments are executed as intended; that key stakeholders can verify that this is happening (or notice if it isn’t); that there are opportunities for third-party critique; and that changes to the RSP itself don’t happen in a rushed or opaque way?
Adopting an RSP should be a strong and reliable signal that an AI developer will in fact identify when it’s too dangerous to keep scaling up capabilities, and react appropriately.
I’m excited about labs adopting RSPs for several reasons:
Making commitments and planning-in-advance is good for safety
Making commitments is good for coordination and helping other labs act well
Making commitments is good for transparency and helping outsiders predict what labs will do
Making safety commitments conditional on risks is nice — everyone, including people who think risk is very small, should be able to agree on warning signs that should lead to stronger safety practices or pausing scaling
Habryka is pessimistic:
I hope to say more soon but I am optimistic that we can find thresholds below dangerous levels that “cannot be goodhearted or gamed” too much.
I mean, the good news is that I expect that by focusing on evaluating capability of predictively trained models, (and potentially by doing more research on latent adversarial attacks on more agenty models), problems of deceptive alignment can be sidestepped in practice.
My worry is more along the lines of actually getting people to not build the cool next step AI that’s tantalizingly over the horizon. Government may be a mechanism for enforcing some measure of institutional buy-in, though.
It’s good news if you’re right.
If you and others expect this and are wrong, it’s very bad news.
Overconfidence in plausible-to-many schemes is one of the main ways we die.
Why do you expect this? Is this a ~70% ‘expect’ or a ~99% ‘expect’?
Do you expect capability evaluations to be robust (i.e. no false negatives)? Why?
Do you expect it’ll be clear when we might start getting false negatives? Why?
What do you mean by “deceptive alignment” exactly? Does your definition cover every case of [this thing looked
aligned[EDIT: safe] according to tests we believed were thorough, but later it killed us anyway]? If not, there’s little comfort in knowing that the thing that killed us didn’t meet some particular definition of deceptive alignment.I can understand hoping that we’d sidestep most cases of deceptive alignment this way.
It’s not clear to me how you get to expecting we’ll avoid them all.
(And, of course, “We expect this will be ok” is not the bar to shoot for—though I imagine we all agree on that).
Let’s say 95%.
But I should probably clarify what “this” I’m supposed to be expecting here.
If you’re trying to notice when AIs are dangerous in different ways by evaluating their capabilities, you face a thorny problem that you’re probably way worse at eliciting capabilities than the aggregate effort of the public. And you’re probably also less creative about ways that AIs can be dangerous. And there might be ecosystem effects, where AIs used together might be complementary in ways that bypass restrictions on single AIs.
My claim is that the AI deliberately deceiving you about its capabilities is not one of those thorny problems, in the current paradigm.
I expect that we can make a false positive / false negative tradeoff. However, this is practically guaranteed to make correctly calibrated standards seem overly conservative to human intuition. This has obvious problems for getting people bought into “RSPs” that actually do their job.
I now suspect you’re asking “do I expect it’ll be clear when we start getting false negatives because the AI is situationally aware and is deceiving us.” The previous answer should be interpreted as being about false negatives due to other issues as well.
I think that evaluating capabilities of an AI trained on predictive loss is so hard to deliberately deceive us about that we would be toast for other reasons before we ran any tests.
I can certainly imagine possible architectures that would have enough control over their internals to be able to sandbag on a test purely by knowing it was coming, but although someone would probably try to build AI like this, it does seem like the sort of thing you would notice, barring human deception to get around irksome safety oversight.
I’m unsure what you’re getting at. I think “testing for alignment” isn’t a very good goal. To succeed at AI alignment we should know ahead of time what we want the AI to be doing, and perhaps if this can be broken down into pieces we can test that the AI is doing each piece, and that would count as “testing for alignment.” But I don’t think there’s some useful alignment test that’s easier than knowing how to align an AI.
So here you’re talking about situations where a false negative doesn’t have catastrophic consequences? Do you have reasons to believe catastrophic consequences are extremely unlikely beyond something like “it’s not clear to me how catastrophic consequences would happen”?
I agree we can find a reasonable balance (which will appear too conservative to most) to address non-catastrophic issues—but this doesn’t seem to help much. To the extent that it creates a false sense of security for more powerful systems, that’s bad.
E.g. this kind of language seems tailor-made for a false sense of security:
Everything’s fine: we have “strong, practical, field-tested RSPs”!
(of course people shouldn’t conclude that everything’s fine—but they could be forgiven for asking why we weren’t much clearer about the inadequacy of such RSPs ahead of time)
We don’t need “deliberately deceive us” for something as bad as deceptive alignment.
We only need to be robustly mistaken, not deliberately deceived.
A model doing something equivalent to this (though presumably not mechanistically like this) will do:
If (have influence over adequate resources)
Search widely for ways to optimize for x.
Else
Follow this heuristic-for-x that produces good behaviour on the training and test set.
So long as x isn’t [target that produces aligned behaviour], and a sufficiently powerful system eventually gets [influence over adequate resources], but was never in this situation in training, that’s enough for us to be screwed. (of course I expect something like [both “have influence over adequate resources” and “find better ways to aim for x” to be processes that are used at lower levels by heuristic-for-x])
I note in passing that extreme optimization for [accurately predict next token] does destroy the world (even in the myopic case, with side channels) - so in this case it’s not as though a bad outcome hinges on the model optimizing for something other than that for which it was trained.
I stress that this example is to illustrate that [we are robustly mistaken] does not require [there is deliberate deception]. I don’t claim an example of this particular form is probable.
I do claim that knowing there’s no deliberate deception is insufficient—and more generally that we’ll fail to think of all the tests that would be necessary. (absent fundamental breakthroughs)
Are you 95% on [by … we can sidestep all problems of the form [this thing looked
alignedsafe according to tests we believed were thorough, but later it killed us anyway]]? (oh I meant ‘safe’ the first time, not ‘aligned’, apologies)If not, sidestepping deception doesn’t look particularly important.
If so, I remain confused by your level of confidence.
No, we’ll have to make false positive / false negative tradeoffs about ending the world as well. We’re unlucky like that.
I agree that false sense of security / safetywashing is a potential use of this kind of program.
I think this is a great point. For architectures that might learn this kind of behavior (ones that do self-reflection during inference), even somewhat-reliably evaluating their capabilities would require something like latent prompting—being able to search for what states of self-reflection would encourage them to display high capabilities.
I’m somewhat more confident in our ability to think of things to test for. If an AI has “that spark of generality,” it can probably figure out how to strategically deceive humans and hack computers and other obvious danger signs.
I retain the right to be confident in unimportant things :P
Maybe it would help for me to point out that my comment was in reply to a big quote of Habryka warning about deceptive alignment as a fundamental problem with evals.
Thanks, that’s clarifying.
A couple of small points:
I think it’s dangerous to assume that the kind of behaviour I’m pointing at requires explicit self-reflection during inference. That’s the obvious example to illustrate the point—but I’m reluctant to assume [x is the obvious way to get y] implies [x is required for y].
Here again, I’d expect us to test for the obvious ways that make sense to us (e.g. simple, explicit mechanisms, and/or the behaviours they’d imply), leaving the possibility of getting blind-sided by some equivalent process based on a weird-to-us mechanism.
Ah, I see. He warned about “things like” deceptive alignment and treacherous turns. I guess you were thinking “things such as”, and I was thinking “things resembling”. (probably because that’s what I tend to think about—I assume that if deceptive alignment is solved it’ll be as a consequence of a more general approach that also handles [we are robustly mistaken] cases, so that thinking about only deception isn’t likely to get us very far; of course I may be wrong :))
There’s one piece of good news here: That there seems to be an effort to install infra-structure for monitoring and evaluation and building good habits. That itself seems to me definitely a move in the right direction, although if it is used for safety-washing could easily turn out to be very negative. On the other hand, it might create new leverage for the safety community by incentivizing labs to make their safety strategy concrete and thereby amenable to critique.
Beyond that, I feel like there’s not much new information yet. It all depends on the details of the implementation. Both a total stop and an accelerationist attitude could in principle be pursued under this framework as far as I understand.
Making commitments conditional on risks would be nice.
Making commitments conditional on [risks that we notice] is clearly inadequate.
That this distinction isn’t made in giant red letters by ARC Evals is disappointing.
Evals might well be great—conditional on clarity that, absent fundamental breakthroughs in our understanding, they can only tell us [model is dangerous], not [model is safe]. Without that clarity, both Evals generally, and RSPs specifically, seem likely to engender dangerous overconfidence.
Another thing I’d like to be clearer is that the following can both be true:
RSPs are a sensible approach for labs/nations to adopt unilaterally.
RSPs are inadequate as a target for global coordination.
I’m highly uncertain about (1).
(2) is obviously true. (obvious because the risk is much too high—not that it’s impossible we get extremely lucky)