RSPs are pauses done right
COI: I am a research scientist at Anthropic, where I work on model organisms of misalignment; I was also involved in the drafting process for Anthropic’s RSP. Prior to joining Anthropic, I was a Research Fellow at MIRI for three years.
Thanks to Kate Woolverton, Carson Denison, and Nicholas Schiefer for useful feedback on this post.
Recently, there’s been a lot of discussion and advocacy around AI pauses—which, to be clear, I think is great: pause advocacy pushes in the right direction and works to build a good base of public support for x-risk-relevant regulation. Unfortunately, at least in its current form, pause advocacy seems to lack any sort of coherent policy position. Furthermore, what’s especially unfortunate about pause advocacy’s nebulousness—at least in my view—is that there is a very concrete policy proposal out there right now that I think is basically necessary as a first step here, which is the enactment of good Responsible Scaling Policies (RSPs). And RSPs could very much live or die right now based on public support.
If you’re not familiar with the concept of an RSP, the central idea of RSPs is evaluation-gated scaling—that is, AI labs can only scale models depending on some set of evaluations that determine whether additional scaling is appropriate. ARC’s definition is:
An RSP specifies what level of AI capabilities an AI developer is prepared to handle safely with their current protective measures, and conditions under which it would be too dangerous to continue deploying AI systems and/or scaling up AI capabilities until protective measures improve.
How do we make it to a state where AI goes well?
I want to start by taking a step back and laying out a concrete plan for how we get from where we are right now to a policy regime that is sufficient to prevent AI existential risk.
The most important background here is my “When can we trust model evaluations?” post, since knowing the answer to when we can trust evaluations is extremely important for setting up any sort of evaluation-gated scaling. The TL;DR there is that it depends heavily on the type of evaluation:
A capabilities evaluation is defined as “a model evaluation designed to test whether a model could do some task if it were trying to. For example: if the model were actively trying to autonomously replicate, would it be capable of doing so?”
With the use of fine-tuning, and a bunch of careful engineering work, capabilities evaluations can be done reliably and robustly.
A safety evaluation is defined as “a model evaluation designed to test under what circumstances a model would actually try to do some task. For example: would a model ever try to convince humans not to shut it down?”
Currently, we do not yet know how to do robust and reliable safety evaluations. This will likely require developing understanding-based safety evaluations.
With that as background, here’s a broad picture of how things could go well via RSPs (note that everything here is just one particular story of success, not necessarily the only story of success we should pursue or a story that I expect to actually happen by default in the real world):
AI labs put out RSP commitments to stop scaling when particular capabilities benchmarks are hit, resuming only when they are able to hit particular safety/alignment/security targets.
Early on, as models are not too powerful, almost all of the work is being done by capabilities evaluations that determine that the model isn’t capable of e.g. takeover. The safety evaluations are mostly around security and misuse risks.
For later capabilities levels, however, it is explicit in all RSPs that we do not yet know what safety metrics could demonstrate safety for a model that might be capable of takeover.
Seeing the existing RSP system in place at labs, governments step in and use it as a basis to enact hard regulation.
By the time it is necessary to codify exactly what safety metrics are required for scaling past models that pose a potential takeover risk, we have clearly solved the problem of understanding-based evals and know what it would take to demonstrate sufficient understanding of a model to rule out e.g. deceptive alignment.
Understanding-based evals are adopted by governmental RSP regimes as hard gating evaluations for models that pose a potential takeover risk.
Once labs start to reach models that pose a potential takeover risk, they either:
Solve mechanistic interpretability to a sufficient extent that they are able to pass an understanding-based eval and demonstrate that their models are safe.
Get blocked on scaling until mechanistic interpretability is solved, forcing a reroute of resources from scaling to interpretability.
Reasons to like RSPs
Obviously, the above is only one particular story for how things go well, but I think it’s a pretty solid one. Here are some reasons to like it:
It provides very clear and concrete policy proposals that could realistically be adopted by labs and governments (in fact, step 1 has already started!). Labs and governments don’t know how to respond to nebulous pause advocacy because it isn’t clearly asking for any particular policy (since nobody actually likes and is advocating for the six month pause proposal).
It provides early wins that we can build on later in the form of initial RSP commitments with explicit holes in them. From “AI coordination needs clear wins”:
“In the theory of political capital, it is a fairly well-established fact that ‘Everybody Loves a Winner.’ That is: the more you succeed at leveraging your influence to get things done, the more influence you get in return. This phenomenon is most thoroughly studied in the context of the ability of U.S. presidents’ to get their agendas through Congress—contrary to a naive model that might predict that legislative success uses up a president’s influence, what is actually found is the opposite: legislative success engenders future legislative success, greater presidential approval, and long-term gains for the president’s party.
I think many people who think about the mechanics of leveraging influence don’t really understand this phenomenon and conceptualize their influence as a finite resource to be saved up over time so it can all be spent down when it matters most. But I think that is just not how it works: if people see you successfully leveraging influence to change things, you become seen as a person who has influence, has the ability to change things, can get things done, etc. in a way that gives you more influence in the future, not less.”
One of the best, most historically effective ways to shape governmental regulation is to start with voluntary commitments. Governments are very good at solving “80% of the players have committed to safety standards but the remaining 20% are charging ahead recklessly” because the solution in that case is obvious and straightforward.
Though we could try to go to governments first rather than labs first, so far I’ve seen a lot more progress with the labs-first approach—though there’s no reason we can’t continue to pursue both in parallel.
RSPs are clearly and legibly risk-based: they specifically kick in only when models have capabilities that are relevant to downstream risks. That’s important because it gives the proposal substantial additional seriousness, since it can point directly to clear harms that it is targeted at preventing.
Additionally, from an x-risk perspective, I don’t even think it actually matters that much what the capability evaluations are here: most potentially dangerous capabilities should be highly correlated, such that measuring any of them should be okay. Thus, I think it should be fine to mostly focus on measuring the capabilities that are most salient to policymakers and most clearly demonstrate risks. And we can directly test the extent to which relevant capabilities are correlated: if they aren’t, we can change course.
Since the strictest conditions of the RSPs only come into effect for future, more powerful models, it’s easier to get people to commit to them now. Labs and governments are generally much more willing to sacrifice potential future value than realized present value.
Additionally, gating scaling only when relevant capabilities benchmarks are hit means that you don’t have to be as at odds with open-source advocates or people who don’t believe current LLMs will scale to AGI. There is still a capabilities benchmark below which open-source is fine (though it should be a lower threshold than closed-source, since there are e.g. misuse risks that are much more pronounced for open-source), and if it turns out that LLMs don’t ever scale to hit the relevant capabilities benchmarks, then this approach won’t ever restrict them.
Using understanding of models as the final hard gate is a condition that—if implemented correctly—is intuitively compelling and actually the thing we need to ensure safety. As I’ve said before, “the only worlds I can imagine myself actually feeling good about humanity’s chances are ones in which we have powerful transparency and interpretability tools that lend us insight into what our models are doing as we are training them.”
How do RSPs relate to pauses and pause advocacy?
In my opinion, RSPs are pauses done right: if you are advocating for a pause, then presumably you have some resumption condition in mind that determines when the pause would end. In that case, just advocate for that condition being baked into RSPs! And if you have no resumption condition—you want a stop rather than a pause—I empathize with that position but I don’t think it’s (yet) realistic. As I discussed above, it requires labs and governments to sacrifice too much present value (rather than just potential future value), isn’t legibly risk-based, doesn’t provide early wins, etc. Furthermore, I think the best way to actually make a full stop happen is still going to look like my story above, just with RSP thresholds that are essentially impossible to meet.
Furthermore, I want to be very clear that I don’t mean “stop pestering governments and focus on labs instead”—we should absolutely try to get governments to adopt RSP-like policies and get as strong conditions as possible into any RSP-like policies that they adopt. What separates pause advocacy from RSP advocacy isn’t who it’s targeted at, but the concreteness of the policy recommendations that it’s advocating for. The point is that advocating for a “pause” is nebulous and non-actionable—“enact an RSP” is concrete and actionable. Advocating for labs and governments to enact as good RSPs as possible is a much more effective way to actually produce concrete change than highly nebulous pause advocacy.
Furthermore, RSP advocacy is going to be really important! I’m very worried that we could fail at any of the steps above, and advocacy could help substantially. In particular:
We need to actually get as many labs as possible to put out RSPs.
Currently, only Anthropic has done so, but I have heard positive signals from other labs and I think with sufficient pressure they might be willing to put out their own RSPs as well.
We need to make sure that those RSPs actually commit to the right things. What I’m looking for are:
Fine-tuning-based capabilities evaluations being used for below-takeover-potential models.
Evidence that capabilities evaluations will be done effectively and won’t be sandbagged (e.g. committing to use an external auditor).
An explicitly empty hole for safety evaluations for takeover-risk models that can be filled in later by progress on understanding-based evals.
We need to get governments to enact mandatory RSPs for all AI labs.
And these RSPs also need to have all the same important properties as the labs’ RSPs. Ideally, we should get the governmental RSPs to be even stronger!
We need to make sure that, once we have solid understanding-based evals, governments make them mandatory.
I’m especially worried about this point, though I don’t think it’s that hard of a sell: the idea that you should understand what your AI is doing on a deep level is a pretty intuitive one.
- Catastrophic sabotage as a major threat model for human-level AI systems by 22 Oct 2024 20:57 UTC; 77 points) (
- Vaniver’s thoughts on Anthropic’s RSP by 28 Oct 2023 21:06 UTC; 46 points) (
- AI #34: Chipping Away at Chip Exports by 19 Oct 2023 15:00 UTC; 36 points) (
- AISC project: TinyEvals by 22 Nov 2023 20:47 UTC; 22 points) (
- AI Safety Evaluations: A Regulatory Review by 19 Mar 2024 15:05 UTC; 21 points) (
- AI Alignment [Incremental Progress Units] this week (10/08/23) by 16 Oct 2023 1:46 UTC; 14 points) (
- AI Safety Evaluations: A Regulatory Review by 19 Mar 2024 15:09 UTC; 12 points) (EA Forum;
- The current state of RSPs by 4 Nov 2024 16:00 UTC; 12 points) (
- 25 Oct 2023 13:35 UTC; 3 points) 's comment on AI Pause Will Likely Backfire (Guest Post) by (
Thanks for writing this, Evan! I think it’s the clearest writeup of RSPs & their theory of change so far. However, I remain pretty disappointed in the RSP approach and the comms/advocacy around it.
I plan to write up more opinions about RSPs, but one I’ll express for now is that I’m pretty worried that the RSP dialogue is suffering from motte-and-bailey dynamics. One of my core fears is that policymakers will walk away with a misleadingly positive impression of RSPs. I’ll detail this below:
What would a good RSP look like?
Clear commitments along the lines of “we promise to run these 5 specific tests to evaluate these 10 specific dangerous capabilities.”
Clear commitments regarding what happens if the evals go off (e.g., “if a model scores above a 20 on the Hubinger Deception Screener, we will stop scaling until it has scored below a 10 on the relatively conservative Smith Deception Test.”)
Clear commitments regarding the safeguards that will be used once evals go off (e.g., “if a model scores above a 20 on the Cotra Situational Awareness Screener, we will use XYZ methods and we believe they will be successful for ABC reasons.”)
Clear evidence that these evals will exist, will likely work, and will be conservative enough to prevent catastrophe
Some way of handling race dynamics (such that Bad Guy can’t just be like “haha, cute that you guys are doing RSPs. We’re either not going to engage with your silly RSPs at all, or we’re gonna publish our own RSP but it’s gonna be super watered down and vague”).
What do RSPs actually look like right now?
Fairly vague commitments, more along the lines of “we will improve our information security and we promise to have good safety techniques. But we don’t really know what those look like.
Unclear commitments regarding what happens if evals go off (let alone what evals will even be developed and what they’ll look like). Very much a “trust us; we promise we will be safe. For misuse, we’ll figure out some way of making sure there are no jailbreaks, even though we haven’t been able to do that before.”
Also, for accident risks/AI takeover risks… well, we’re going to call those “ASL-4 systems”. Our current plan for ASL-4 is “we don’t really know what to do… please trust us to figure it out later. Maybe we’ll figure it out in time, maybe not. But in the meantime, please let us keep scaling.”
Extremely high uncertainty about what safeguards will be sufficient. The plan essentially seems to be “as we get closer to highly dangerous systems, we will hopefully figure something out.”
No strong evidence that these evals will exist in time or work well. The science of evaluations is extremely young, the current evals are more like “let’s play around and see what things can do” rather than “we have solid tests and some consensus around how to interpret them.”
No way of handling race dynamics absent government intervention. In fact, companies are allowed to break their voluntary commitments if they’re afraid that they’re going to lose the race to a less safety-conscious competitor. (This is explicitly endorsed in ARC’s post and Anthropic includes such a clause.)
Important note: I think several of these limitations are inherent to current gameboard. Like, I’m not saying “I think it’s a bad move for Anthropic to admit that they’ll have to break their RSP if some Bad Actor is about to cause a catastrophe.” That seems like the right call. I’m also not saying that dangerous capability evals are bad—I think it’s a good bet for some people to be developing them.
Why I’m disappointed with current comms around RSPs
Instead, my central disappointment comes from how RSPs are being communicated. It seems to me like the main three RSP posts (ARC’s, Anthropic’s, and yours) are (perhaps unintentionally?) painting and overly-optimistic portrayal of RSPs. I don’t expect policymakers that engage with the public comms to walk away with an appreciation for the limitations of RSPs, their current level of vagueness + “we’ll figure things out later”ness, etc.
On top of that, the posts seem to have this “don’t listen to the people who are pushing for stronger asks like moratoriums—instead please let us keep scaling and trust industry to find the pragmatic middle ground” vibe. To me, this seems not only counterproductive but also unnecessarily adversarial. I would be more sympathetic to the RSP approach if it was like “well yes, we totally think it’d great to have a moratorium or a global compute cap or a kill switch or a federal agency monitoring risks or a licensing regime”, and we also think this RSP thing might be kinda nice in the meantime. Instead, ARC implies that the moratorium folks are unrealistic, and tries to say they operate on an extreme end of the spectrum, on the opposite side of those who believe it’s too soon to worry about catastrophes whatsoever.
(There’s also an underlying thing here where I’m like “the odds of achieving a moratorium, or a licensing regime, or hardware monitoring, or an agency that monitors risks and has emergency powers— the odds of meaningful policy getting implemented are not independent of our actions. The more that groups like Anthropic and ARC claim “oh that’s not realistic”, the less realistic those proposals are. I think people are also wildly underestimating the degree to which Overton Windows can change and the amount of uncertainty there currently is among policymakers, but this is a post for another day, perhaps.)
I’ll conclude by noting that some people have went as far as to say that RSPs are intentionally trying to dilute the policy conversation. I’m not yet convinced this is the case, and I really hope it’s not. But I’d really like to see more coming out of ARC, Anthropic, and other RSP-supporters to earn the trust of people who are (IMO reasonably) suspicious when scaling labs come out and say “hey, you know what the policy response should be? Let us keep scaling, and trust us to figure it out over time, but we’ll brand it as this nice catchy thing called Responsible Scaling.”
Strongly agree with almost all of this.
My main disagreement is that I don’t think the “What would a good RSP look like?” description is sufficient without explicit conditions beyond evals. In particular that we should expect that our suite of tests will be insufficient at some point, absent hugely improved understanding—and that we shouldn’t expect to understand how and why it’s insufficient before reality punches us in the face.
Therefore, it’s not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them to work as intended].
We also need strong evidence that there’ll be no catastrophe-inducing problems we didn’t think of. (not [none that SotA methods could find], not [none discoverable with reasonable cost]; none)
This can’t be implicit, since it’s a central way that we die.
If it’s hard/impractical to estimate, then we should pause until we can estimate it more accurately
This is the kind of thing that I expect to be omitted from RSPs as a matter of course, precisely because we lack the understanding to create good models/estimates, legible tests etc. That doesn’t make it ok. Blank map is not blank territory.
If we’re thinking of better mechanisms to achieve a pause, I’d add:
Call it something like a “Responsible training and deployment policy (RTDP)”, not an RSP. Scaling is the thing in question. We should remove it from the title if we want to give the impression that it might not happen. (compare “Responsible farming policy”, “Responsible fishing policy”, “Responsible diving policy”—all strongly imply that responsible x-ing is possible, and that x-ing will continue to happen subject to various constraints)
Don’t look for a ‘practical’ solution. A serious pause/stop will obviously be impractical (yet not impossible). To restrict ourselves to practical approaches is to give up on any meaningful pause. Doing the impractical is not going to get easier later.
Define ASLs or similar now rather than waiting until we’re much closer to achieving them. Waiting to define them later gives the strong impression that the approach is [pick the strongest ASL definitions and measures that will achievable so that we can keep scaling] and not, [pick ASL definitions and measures that are clearly sufficient for safety].
Evan’s own “We need to make sure that, once we have solid understanding-based evals, governments make them mandatory” only re-enforces this impression. Whether we have them is irrelevant to the question of whether they’re necessary.
Be clear and explicit about the potential for very long pauses, and the conditions that would lead to them. Where it’s hard to give precise conditions, give high-level conditions and very conservative concrete defaults (not [reasonably conservative]; [unreasonably conservative]). Have a policy where a compelling, externally reviewed argument is necessary before any conservative default can be relaxed.
I think there’s a big danger in safety people getting something in place that we think/hope will imply a later pause, only to find that when it really counts the labs decide not to interpret things that way and to press forward anyway—with government/regulator backing, since they’re doing everything practical, everything reasonable....
Assuming this won’t happen seems dangerously naive.
If labs are going to re-interpret the goalposts and continue running into the minefield, we need to know this as soon as possible. This requires explicit clarity over what is being asked / suggested / eventually-entailed.
The Anthropic RSP fails at this IMO: no understanding-based requirements; no explicit mention that pausing for years may be necessary.
The ARC Evals RSP description similarly fails—if RSPs are intended to be a path to pausing. “Practical middle ground” amounts to never realistically pausing. They entirely overlook overconfidence as a problem. (frankly, I find this confusing coming from Beth/Paul et al)
Have separate RTDPs for unilateral adoption, and adoption subject to multi-lab agreement / international agreement etc. (I expect at least three levels would make sense)
This is a natural way to communicate “We’d ideally like [very strict measures], though [less strict measures] are all we can commit to unilaterally”.
If a lab’s unilateral RTDP looks identical to their [conditional on international agreement] RTDP, then they have screwed up.
Strongly consider pushing for safety leads to write and sign the RTDP (with help, obviously). I don’t want the people who know most about safety to be “involved in the drafting process”; I want to know that they oversaw the process and stand by the final version.
I’m sure there are other sensible additions, but that’d be a decent start.
Yeah, I agree—that’s why I’m specifically optimistic about understanding-based evals, since I think they actually have the potential to force us to catch unknown unknowns here, the idea being that they require you to prove that you actually understand your model to a level where you’d know if there were anything wrong that your other evals might miss.
See the bottom of this comment: my main objection here is that if we were to try to define it now, we’d end up defining something easily game-able because we don’t yet have metrics for understanding that aren’t easily game-able. So if we want something that will actually be robust, we have to wait until we know what that something might be—and ideally be very explicit that we don’t yet know what we could put there.
I definitely agree that this is a serious concern! That’s part of why I’m writing this post: I want more public scrutiny and pressure on RSPs and their implementation to try to prevent this sort of thing.
IANAL, but I think that this is currently impossible due to anti-trust regulations. The White House would need to enact a safe harbor policy for anti-trust considerations in the context of AI safety to make this possible.
I don’t know anything about anti-trust enforcement, but it seems to me that this might be a case where labs should do it anyways & delay hypothetical anti-trust enforcement by fighting in court.
I happen to think that the Anthropic RSP is fine for what it is, but it just doesn’t actually make any interesting claims yet. The key thing is that they’re committing to actually having an ASL-4 criteria and safety argument in the future. From my perspective, the Anthropic RSP effectively is an outline for the sort of thing an RSP could be (run evals, have safety buffer, assume continuity, etc) as well as a commitment to finish the key parts of the RSP later. This seems ok to me.
I would preferred if they included tentative proposals for ASL-4 evaluations and what their current best safety plan/argument for ASL-4 looks like (using just current science, no magic). Then, explain that plan wouldn’t be sufficient for reasonable amounts of safety (insofar as this is what they think).
Right now, they just have a bulleted list for ASL-4 countermeasures, but this is the main interesting thing at me. (I’m not really sold on substantial risk from systems which aren’t capable of carrying out that harm mostly autonomously, so I don’t think ASL-3 is actually important except as setup.)
Yeah, of course this would be nice. But the reason that ARC and Anthropic didn’t write this ‘good RSP’ isn’t that they’re reckless, but because writing such an RSP is a hard open problem. It would be great to have “specific tests” for various dangerous capabilities, or “Some way of handling race dynamics,” but nobody knows what those are.
Of course the specific object-level commitments Anthropic has made so far are insufficient. (Fortunately, they committed to make more specific object-level commitments before reaching ASL-3, and ASL-3 is reasonably well-specified [edit: and almost certainly below x-catastrophe-level].) I praise Anthropic’s RSP and disagree with your vibe because I don’t think you or I or anyone else could write much better commitments. (If you have specific commitments-labs-should-make in mind, please share them!)
(Insofar as you’re just worried about comms and what-people-think-about-RSPs rather than how-good-RSPs-are, I’m agnostic.)
Thanks, Zach! Responses below:
I agree that writing a good RSP is a hard open problem. I don’t blame ARC for not having solved the “how can we scale safely” problem. I am disappointed in ARC for communicating about this poorly (in their public blog post, and [speculative/rumor-based] maybe in their private government advocacy as well.).
I’m mostly worried about the comms/advocacy/policy implications. If Anthropic and ARC had come out and said “look, we have some ideas, but the field really isn’t mature enough and we really don’t know what we’re doing, and these voluntary commitments are clearly insufficient, but if you really had to ask us for our best-guesses RE what to do if there is no government regulation coming, and for some reason we had to keep racing toward god-like AI, here are our best guesses. But please note that this is woefully insufficient and we would strongly prefer government intervention to buy enough time so that we can have actual plans.”
I also expect most of the (positive or negative) impact of the recent RSP posts to come from the comms/advocacy/policy externalities.
I don’t think the question of whether you or I could write better commitments is very relevant. My claim is more like “no one can make a good enough RSP right now, so instead of selling governments on RSPs, we should be communicating clearly that the current race to godlike AI is awful, our AIS ideas are primitive, we might need to pause for decades, and we should start developing the hardware monitoring//risk assessment//emergency powers//kill switch infrastructure//international institutions that we will need.”
But if I had to answer this directly: I actually do think that if I spent 1-2 weeks working on coming up with better commitments, and I could ask for feedback from like 1-3 advisors of my choice, I could probably come up with “better” commitments. I don’t think this is because I’m particularly smart, though– I just think the bar is low. My impression is that the 22-page doc from Anthropic didn’t actually have many commitments.
The main commitments that stood out to me were: (a) run evals [exact evals unspecified] at least every 4X in effective compute, (b) have good infosec before you have models that can enable bioweapons or other Scary Misuse models, and (c) define ASL-4 criteria once you have Scary Misuse models. There are some other more standard/minor things as well like sharing vulnerabilities with other labs, tiered model access, and patching jailbreaks [how? and how much is sufficient?].
The main caveat I’ll add is that “better” is a fuzzy term in this context. Like, I’m guessing lot of the commitments I’d come up with are things that are more costly from Anthropic’s POV. So maybe many of them would be “worse” in the sense that Anthropic wouldn’t be willing to adopt them, or would argue that other labs are not going to adopt them, therefore they can’t adopt them otherwise they’ll be less likely to win the race.)
i would love to see competing RSPs (or, better yet, RTDPs, as @Joe_Collman pointed out in a cousin comment).
I mean, I am very explicitly trying to communicate what I see as the success story here. I agree that there are many ways that this could fail—I mention a bunch of them in the last section—but I think that having a clear story of how things could go well is important to being able to work to actually achieve that story.
I want to be very clear that I’ve been really happy to see all the people pushing for strong asks here. I think it’s a really valuable thing to be doing, and what I’m trying to do here is not stop that but help it focus on more concrete asks.
To be clear, I definitely agree with this. My position is not “RSPs are all we need”, “pauses are bad”, “pause advocacy is bad”, etc.—my position is that getting good RSPs is an effective way to implement a pause: i.e. “RSPs are pauses done right.”
Some feedback on this: my expectation upon seeing your title was that you would argue, or that you implicitly believe, that RSPs are better than other current “pause” attempts/policies/ideas. I think this expectation came from the common usage of the phrase “done right” to mean that other people are doing it wrong or at least doing it suboptimally.
I mean, to be clear, I am saying something like “RSPs are the most effective way to implement a pause that I know of.” The thing I’m not saying is just that “RSPs are the only policy thing we should be doing.”
This reads as some sort of confused motte and bailey. Are RSPs “an effective way” or “the most effective way… [you] know of”? These are different things, with each being stronger/weaker in different ways. Regardless, the title could still be made much more accurate to your beliefs, e.g. ~’RSPs are our (current) best bet on a pause’. ‘An effective way’ is definitely not “i.e … done right”, but “the most effective way… that I know of” is also not.
I disagree? I think the plain English meaning of the title “RSPs are pauses done right” is precisely “RSPs are the right way to do pauses (that I know of)” which is exactly what I think and exactly what I am defending here. I honestly have no idea what else that title would mean.
Sorry yeah I could have explained what I meant further. The way I see it:
‘X is the most effective way that I know of’ = X tops your ranking of the different ways, but could still be below a minimum threshold (e.g. X doesn’t have to even properly work, it could just be less ineffective than all the rest). So one could imagine someone saying “X is the most effective of all the options I found and it still doesn’t actually do the job!”
‘X is an effective way’ = ‘X works, and it works above a certain threshold’.
‘X is Y done right’ = ‘X works and is basically the only acceptable way to do Y,’ where it’s ambiguous or contextual as to whether ‘acceptable’ means that it at least works, that it’s effective, or sth like ‘it’s so clearly the best way that anyone doing the 2nd best thing is doing something bad’.
Why then “RSPs are the most effective way to implement a pause that I know of” is literally not the title of your post?
Are you thinking about this post? I don’t see any explicit claims that the moratorium folks are extreme. What passage are you thinking about?
In terms of explicit claims:
“So one extreme side of the spectrum is build things as fast as possible, release things as much as possible, maximize technological progress [...].
The other extreme position, which I also have some sympathy for, despite it being the absolutely opposite position, is you know, Oh my god this stuff is really scary.
The most extreme version of it was, you know, we should just pause, we should just stop, we should just stop building the technology for, indefinitely, or for some specified period of time. [...] And you know, that extreme position doesn’t make much sense to me either.”
Dario Amodei, Anthropic CEO, explaining his company’s “Responsible Scaling Policy” on the Logan Bartlett Podcast on Oct 6, 2023.
Starts at around 49:40.
This example is not a claim by ARC though, seems important to keep track of this in a discussion of what ARC did or didn’t claim, even as others making such claims is also relevant.
I was thinking about this passage:
I think “extreme” was subjective and imprecise wording on my part, and I appreciate you catching this. I’ve edited the sentence to say “Instead, ARC implies that the moratorium folks are unrealistic, and tries to say they operate on an extreme end of the spectrum, on the opposite side of those who believe it’s too soon to worry about catastrophes whatsoever.”
This is a really important thing to iron out.
Going forward (through the 2020s), it’s really important to avoid underestimating the ratio of money going into facilitating an AI pause vs money subverting or thwarting an AI pause. The impression I get is that the vast majority of people are underestimating how much money and talent will end up being allocated towards the end of subverting or thwarting an AI pause, e.g. finding galaxy-brained ways to intimidate or mislead well-intentioned AI safety orgs into self-sabotage (e.g. opposing policies that are actually feasible or even mandatory for human survival like an AI pause) or being turned against eachother (which is unambiguously the kind of thing that happens in a world with very high lawyers-per-capita, and in particular in issues and industries where lots of money is at stake). False alarms are almost an equally serious issue because false alarms also severely increase vulnerability, which further incentivises adverse actions against the AI safety community by outsider third parties (e.g. due to signalling high payoff and low risk of detection for any adverse actions).
I’m sympathetic to the idea that it would be good to have concrete criteria for when to stop a pause, were we to start one. But I also think it’s potentially quite dangerous, and corrosive to the epistemic commons, to expect such concreteness before we’re ready to give it.
I’m first going to zoom out a bit—to a broader trend which I’m worried about in AI Safety, and something that I believe evaluation-gating might exacerbate, although it is certainly not the only contributing factor.
I think there is pressure mounting within the field of AI Safety to produce measurables, and to do so quickly, as we continue building towards this godlike power under an unknown timer of unknown length. This is understandable, and I think can often be good, because in order to make decisions it is indeed helpful to know things like “how fast is this actually going” and to assure things like “if a system fails such and such metric, we’ll stop.”
But I worry that in our haste we will end up focusing our efforts under the streetlight. I worry, in other words, that the hard problem of finding robust measurements—those which enable us to predict the behavior and safety of AI systems with anywhere near the level of precision we have when we say “it’s safe for you to get on this plane”—will be substituted for the easier problem of using the measurements we already have, or those which are close by; ones which are at best only proxies and at worst almost completely unrelated to what we ultimately care about.
And I think it is easy to forget, in an environment where we are continually churning out things like evaluations and metrics, how little we in fact know. That when people see a sea of ML papers, conferences, math, numbers, and “such and such system passed such and such safety metric,” that it conveys an inflated sense of our understanding, not only to the public but also to ourselves. I think this sort of dynamic can create a Red Queen’s race of sorts, where the more we demand concrete proposals—in a domain we don’t yet actually understand—the more pressure we’ll feel to appear as if we understand what we’re talking about, even when we don’t. And the more we create this appearance of understanding, the more concrete asks we’ll make of the system, and the more inflated our sense of understanding will grow, and so on.
I’ve seen this sort of dynamic play out in neuroscience, where in my experience the ability to measure anything at all about some phenomenon often leads people to prematurely conclude we understand how it works. For instance, reaction times are a thing one can reliably measure, and so is EEG activity, so people will often do things like… measure both of these quantities while manipulating the number of green blocks on a screen, then call the relationship between these “top-down” or “bottom-up” attention. All of this despite having no idea what attention is, and hence no idea if these measures in fact meaningfully relate much to the thing we actually care about.
There are a truly staggering number of green block-type experiments in the field, proliferating every year, and I think the existence of all this activity (papers, conferences, math, numbers, measurement, etc.) convinces people that something must be happening, that progress must be being made. But if you ask the neuroscientists attending these conferences what attention is, over a beer, they will often confess that we still basically have no idea. And yet they go on, year after year, adding green blocks to screens ad infinitum, because those are the measurements they can produce, the numbers they can write on grant applications, grants which get funded because at least they’re saying something concrete about attention, rather than “I have no idea what this is, but I’d like to figure it out!”
I think this dynamic has significantly corroded academia’s ability to figure out important, true things, and I worry that if we introduce it here, that we will face similar corrosion.
Zooming back in on this proposal in particular: I feel pretty uneasy about the messaging, here. When I hear words like “responsible” and “policy” around a technology which threatens to vanquish all that I know and all that I love, I am expecting things more like “here is a plan that gives us multiple 9’s of confidence that we won’t kill everyone.” I understand that this sort of assurance is unavailable, at present, and I am grateful to Anthropic for sharing their sketches of what they hope for in the absence of such assurances.
But the unavailability of such assurance is also kind of the point, and one that I wish this proposal emphasized more… it seems to me that vague sketches like these ought to be full of disclaimers like, “This is our best idea but it’s still not very reassuring. Please do not believe that we are safely able to prevent you from dying, yet. We have no 9’s to give.” It also seems to me like something called a “responsible scaling plan” should at the very least have a convincing story to tell about how we might get from our current state, with the primitive understanding we have, to the end-goal of possessing the sort of understanding that is capable of steering a godly power the likes of which we have never seen.
And I worry that in the absence of such a story—where the true plan is something closer to “fill in the blanks as we go”—that a mounting pressure to color in such blanks will create a vacuum, and that we will begin to fill it with the appearance of understanding rather than understanding itself; that we will pretend to know more than we in fact do, because that’s easier to do in the face of a pressure for results, easier than standing our ground and saying “we have no idea what we’re talking about.” That the focus on concrete asks and concrete proposals will place far too much emphasis on what we can find under the streetlight, and will end up giving us an inflated sense of our understanding, such that we stop searching in the darkness altogether, forget that it is even there…
I agree with you that having concrete asks would be great, but I think they’re only great if we actually have the right asks. In the absence of robust measures and evaluations—those which give us high confidence about the safety of AI systems—in the absence of a realistic plan to get those, I think demanding them may end up being actively harmful. Harmful because people will walk away feeling like AI Safety “knows” more than it does and will hence, I think, feel more secure than is warranted.
As I mention in the post, we do have the ability to do concrete capabilities evals right now. What we can’t do are concrete safety evals, which I’m very clear about not expecting us to have right now.
And I’m not expecting that we eventually solve the problem of building good safety evals either—but I am describing a way in which things go well that involves a solution to that problem. If we never solve the problem of understanding-based evals, then my particular sketch doesn’t work as a way to make things go well: but that’s how any story of success has to work right now given that we don’t currently know how to make things go well. And actually telling success stories is an important thing to do!
If you have an alternative success story that doesn’t involve solving safety evals, tell it! But without any alternative to my success story, critiquing it just for assuming a solution to a problem we don’t yet have a solution to—which every success story has to do—seems like an extremely unfair criticism.
This post is not a responsible scaling plan. I feel like your whole comment seems to be weirdly conflating stuff that I’m saying with stuff in the Anthropic RSP. This post is about my thoughts on RSPs in general—which do not necessarily represent Anthropic’s thoughts on anything—and the post isn’t really about Anthropic’s RSP at all.
Regardless, I’m happy to give my take. I don’t think that anybody currently has a convincing story to tell about how to get a good understanding of AI systems, but you can read my thoughts on how we might get to one here.
It sounds like you’re disagreeing with me, but everything you’re saying here is consistent with everything I said. The whole point of my proposal is to understand what evals we can trust and when we can trust them, set up eval-gated scaling in the cases where we can do concrete evals, and be very explicit about the cases where we can’t.
When assumptions are clear, it’s not valuable to criticise the activity of daring to consider what follows from them. When assumptions are an implicit part of the frame, they become part of the claims rather than part of the problem statement, and their criticism becomes useful for all involved, in particular making them visible. Putting burdens on criticism such as needing concrete alternatives makes relevant criticism more difficult to find.
I found this quite hard to parse fyi
Fully agree with almost all of this. Well said.
One nitpick of potentially world-ending importance:
Giving us high confidence is not the bar—we also need to be correct in having that confidence.
In particular, we’d need to be asking: “How likely is it that the process we used to find these measures and evaluations gives us [actually sufficient measures and evaluations] before [insufficient measures and evaluations that we’re confident are sufficient]? How might we tell the difference? What alternative process would make this more likely?...”
I assume you’d roll that into assessing your confidence—but I think it’s important to be explicit about this.
Based on your comment, I’d be interested in your take on:
Put many prominent disclaimers and caveats in the RSP—clearly and explicitly.
vs
Attempt to make commitments sufficient for safety by committing to [process to fill in this gap] - including some high-level catch-all like ”...and taken together, these conditions make training of this system a good idea from a global safety perspective, as evaluated by [external board of sufficiently cautious experts]”.
Not having thought about it for too long, I’m inclined to favor (2).
I’m not at all sure how realistic it is from a unilateral point of view—but I think it’d be useful to present proposals along these lines and see what labs are willing to commit to. If no lab is willing to commit to any criterion they don’t strongly expect to be able to meet ahead of time, that’s useful to know: it amounts to “RSPs are a means to avoid pausing”.
I imagine most labs wouldn’t commit to [we only get to run this training process if Eliezer thinks it’s good for global safety], but I’m not at all sure what they would commit to.
At the least, it strikes me that this is an obvious approach that should be considered—and that a company full of abstract thinkers who’ve concluded “There’s no direct, concrete, ML-based thing we can commit to here, so we’re out of options” don’t appear to be trying tremendously hard.
Who? Science has never worked by means of deferring to a designated authority figure. I agree, of course, that we want people to do things that make the world less rather than more likely to be destroyed. But if you have a case that a given course of action is good or bad, you should expect to be able to argue that case to knowledgeable people who have never heard of this Eliza person, whoever she is.
I remember reading a few good blog posts about this topic by a guest author on Robin Hanson’s blog back in ’aught-seven.
This was just an example of a process I expect labs wouldn’t commit to, not (necessarily!) a suggestion.
The key criterion isn’t even appropriate levels of understanding, but rather appropriate levels of caution—and of sufficient respect for what we don’t know. The criterion [...if aysja thinks it’s good for global safety] may well be about as good as [...if Eliezer thinks it’s good for global safety].
It’s much less about [This person knows], than about [This person knows that no-one knows, and has integrated this knowledge into their decision-making].
Importantly, a cautious person telling an incautious person “you really need to be cautious here” is not going to make the incautious person cautious (perhaps slightly more cautious than their baseline—but it won’t change the way they think).
A few other thoughts:
Scientific intuitions will tend to be towards doing what uncovers information efficiently. If an experiment uncovers some highly significant novel unknown that no-one was expecting, that’s wonderful from a scientific point of view.
This is primarily about risk, not about science. Here the novel unknown that no-one was expecting may not lead to a load of interesting future work, since we all might be dead.
We shouldn’t expect the intuitions or practices of science to robustly point the right way here.
There is no rule that says the world must play fair and ensure that it gives us compelling evidence that a certain path forward will get us killed, before we take the path that gets us killed. The only evidence available may be abstract, indirect and gesture at unknown unknowns.
The situation in ML is unprecedented, in that organizations are building extremely powerful systems that no-one understands. The “experts” [those who understand the systems best] are not experts [those who understand the systems well]. There’s no guarantee that anyone has the understanding to make the necessary case in concrete terms.
If you have a not-fully-concrete case for a certain course of action, experts are divided on that course of action, and huge economic incentives point in the other direction, you shouldn’t be shocked when somewhat knowledgeable people with huge economic incentives follow those economic incentives.
The purpose of committing to follow the outcome of an external process is precisely that it may commit you to actions that you wouldn’t otherwise take. A commitment to consult with x, hear a case from y, etc is essentially empty (if you wouldn’t otherwise seek this information, why should anyone assume you’ll be listening? If you’d seek it without the commitment, what did the commitment change?).
To the extent that decision-makers are likely to be overconfident, a commitment to defer to a less often overconfident system can be helpful. This Dario quote (full context here) doesn’t exactly suggest there’s no danger of overconfidence:
“I mean one way to think about it is like the responsible scaling plan doesn’t slow you down except where it’s absolutely necessary. It only slows you down where it’s like there’s a critical danger in this specific place, with this specific type of model, therefore you need to slow down.”
Earlier there’s:
”...and as we go up the scale we may actually get to the point where you have to very affirmatively show the safety of the model. Where you have to say yes, like you know, I’m able to look inside this model, you know with an x-ray, with interpretability techniques, and say ’yep, I’m sure that this model is not going to engage in this dangerous behaviour because, you know, there isn’t any circuitry for doing this, or there’s this reliable suppression circuitry...”
But this doesn’t address the possibility of being wrong about how early it was necessary to affirmatively show safety.
Nor does it give me much confidence that “affirmatively show the safety of the model” won’t in practice mean something like “show that the model seems safe according to our state-of-the-art interpretability tools”.
Compare that to the confidence I’d have if the commitment were to meet the bar where e.g. Wei Dai agrees that you’ve “affirmatively shown the safety of the model”. (and, again, most of this comes down to Wei Dai being appropriately cautious and cognizant of the limits of our knowledge)
Thanks for writing this up.
I agree that the issue is important, though I’m skeptical of RSPs so far, since we have one example and it seems inadequate—to the extent that I’m positively disposed, it’s almost entirely down to personal encounters with Anthropic/ARC people, not least yourself. I find it hard to reconcile the thoughtfulness/understanding of the individuals with the tone/content of the Anthropic RSP. (of course I may be missing something in some cases)
Going only by the language in the blog post and the policy, I’d conclude that they’re an excuse to continue scaling while being respectably cautious (though not adequately cautious). Granted, I’m not the main target audience—but I worry about the impression the current wording creates.
I hope that RSPs can be beneficial—but I think much more emphasis should be on the need for positive demonstration of safety properties, that this is not currently possible, and that it may take many years for that to change. (mentioned, but not emphasized in the Anthropic policy—and without any “many years” or similar)
It’s hard to summarize my concerns, so apologies if the following ends up somewhat redundant.
I’ll focus on your post first, and the RSP blog/policy doc after that.
There’s an obvious thing to do here. It’s far from obvious that it’s a solution.
One of my main worries with RSPs is that they’ll be both [plausibly adequate as far as governments can tell] and [actually inadequate]. That’s much worse than if they were clearly inadequate.
They kick in when we detect that models have capabilities that we realize are relevant to downstream risks.
Both detection and realization can fail.
My main worry here isn’t that we’ll miss catastrophic capabilities in the near term (though it’s possible). Rather it’s the lack of emphasis on this distinction: that tests will predictably fail to catch problems, and that there’s a decent chance some of them fail before we expect them to.
This could use greater emphasis in the RSP blog/doc.
Yes!
We need governments to make them mandatory before they’re necessary, not once we have them (NB, not [before it’s clear they’re necessary] - it might not be clear). I don’t expect us to have sufficiently accurate understanding-based evals before they’re necessary. (though it’d be lovely)
Pushing to require state-of-the-art safety techniques is the wrong emphasis.
We need to push for adequate safety techniques. If state-of-the-art techniques aren’t yet adequate, then labs need to stop.
Thoughts on the blog/doc themselves. Something of a laundry list, but hopefully makes clear where I’m coming from:
My top-level concern is overconfidence: to the extent that we understand what’s going on, and things are going as expected, I think RSPs similar to Anthropic’s should be pretty good. This gives me very little comfort, since I expect catastrophes to occur when there’s something unexpected that we’ve failed to understand.
Both the blog post and the policy document fail to make this sufficiently clear.
Examples:
From the blog: “On the one hand, the ASL system implicitly requires us to temporarily pause training of more powerful models if our AI scaling outstrips our ability to comply with the necessary safety procedures. But it does so in a way that directly incentivizes us to solve the necessary safety issues as a way to unlock further scaling...”.
This is not true: the incentive is to satisfy the conditions in the RSP. That’s likely to mean that the lab believes they’ve solved the necessary safety issues. They may not be correct about that.
To the extent that they triple-check even after they think all is well, that’s based on morality/self-preservation. The RSP incentives do not push that way. Incorrectly believing they push that way doesn’t give me confidence.
No consideration of the possibility of jumping from ASL-(n) to ASL-(n+2).
No consideration of a model being ASL-(n+1), but showing no detectable warning signs beyond ASL-n. (there’s a bunch on bumping into ASL-n before expecting to—but not on this going undetected)
I expect the preceding to be unusual; conditional on catastrophe, I expect something unusual has happened.
On evals:
Demanding capabilities may be strongly correlated, so that it doesn’t matter too much if we fail to test for everything important. Alternatively, it could be the case that we do need to cover all the bases, since correlations aren’t as strong as we expect. In that case, [covering all the bases that we happen to think of] may not be sufficient. (this seems unlikely, but possible, to me)
More serious is the possibility that there are methods of capability elicitation/amplification that the red-teamers don’t find. For example, if no red-teamer had thought to try chain-of-thought approaches, capabilities might have been missed. Where is the guarantee that nothing like this is missed?
I don’t see any correlation-based defense here—it seems quite possible that some ways to extract capabilities are just much better than others. What confidence level should we have that testing finds the best ways?
Why isn’t it emphasized that red-teaming can show that something is dangerous, but not that it’s safe? Where’s the discussion around how often we should expect tests to fail to catch important problems? Where’s the discussion around p(model is dangerous | model looks safe to us)? Is this low? Why? When? When will this change? How will we know?...
In general the doc seems to focus on [we’re using the best techniques currently available], and fails to make a case that [the best techniques currently available are sufficient].
E.g. page 16: “Evaluations should be based on the best capabilities elicitation techniques we are aware of at the time”
This worries me because governments/regulators are used to situations where state-of-the-art tests are always adequate (since building x tends to imply understanding x, outside ML). Therefore, I’d want to see this made explicit and clear.
This is the closest I can find, but it’s rather vague:
”Complying with higher ASLs is not just a procedural matter, but may sometimes require research or technical breakthroughs to give affirmative evidence of a model’s safety (which is generally not possible today)...”
It’d be nice if the reader couldn’t assume throughout that the kind of research/breakthrough being talked about is the kind that’s routinely doable within a few months, rather than the kind that may take a decade.
Miscellaneous:
From the policy document, page 2:
”As AI systems continue to scale, they may become capable of increased autonomy that enables them to proliferate and, due to imperfections in current methods for steering such systems, potentially behave in ways contrary to the intent of their designers or users.”
To me ”...imperfections in current methods...” seems misleading—it gives the impression that labs basically know what they’re doing on alignment, but need to add a few tweaks here and there. I don’t believe this is true, and I’d be surprised to learn that many at Anthropic believe this.
Policy doc, page 3:
”Rather than try to define all future ASLs and their safety measures now (which would almost certainly not stand the test of time)...”
This seems misleading since it’s not hard to define ASLs and safety measures which would stand the test of time: the difficult thing is to define measures that stand the test of time, but allow scaling to continue.
There’s an implicit assumption here that the correct course is to allow as much scaling as we can get away with, rather than to define strict measures that would stop things for the foreseeable future—given that we may be overconfident.
I don’t think it’s crazy to believe the iterative approach is best, but I do think it deserves explicit argument. If the argument is “yes, stricter measures would be nice, but aren’t realistic right now”, then please say this (not just here in your post, I mean—somewhere clear to government people).
In particular, I think it’s principled to make clear that a lab would accept more strict conditions if they were universally enforced than those it would unilaterally adopt.
Conversely, I find it worrying for a lab to say “we’re unilaterally doing x, and we think [everyone doing x] is the thing to aim for”, since I expect the x that makes unilateral sense to be inadequate as a global coordination target.
Page 10:
”We will manage our plans and finances to support a pause in model training if one proves necessary”
This seems nice, but gives the impression more of [we might need to pause for six months] than [we might need to pause for ten years]. Given that the latter seems possible, it seems important to acknowledge that radical contingency plans would be necessary for this—and to have such plans (potentially with government assistance, and/or [stuff that hasn’t occurred to me]).
Without that, there’ll be an unhelpful incentive to cut corners or to define inadequate ASLs on the basis that they seem more achievable.
I’m mostly not going to comment on Anthropic’s RSP right now, since I don’t really want this post to become about Anthropic’s RSP in particular. I’m happy to talk in more detail about Anthropic’s RSP maybe in a separate top-level post dedicated to it, but I’d prefer to keep the discussion here focused on RSPs in general.
I definitely share this worry. But that’s part of why I’m writing this post! Because I think it is possible for us to get good RSPs from all the labs and governments, but it’ll take good policy and advocacy work to make that happen.
I agree that this is a serious concern, though I think that at least in the case of capabilities evaluations, it should be solvable. Though it’ll require those capabilities evaluations to actually be done effectively, I think we at least do know how to do effective capabilities evaluations—it’s mostly a solved problem in theory and just requires good implementation.
The distinction between an alignment technique and an alignment evaluation is very important here: I very much am trying to push for adequate safety techniques rather than simply state-of-the-art safety techniques, and the way I’m proposing we do that is via evaluations that check whether we understand our models. What I think probably needs to happen before you can put understanding-based evals in an RSP is not that we have to solve mechanistic interpretability—it’s that we have to solve understanding-based evals. That is, we need to know how to evaluate whether mechanistic interpretability has been solved or not. My concern with trying to put something like that into an RSP right now is that it’ll end up evaluating the wrong thing: since we don’t yet know how to effectively evaluate understanding, any evaluation we set up right now would probably be too game-able to actually be workable here.
This seems an overstatement to me:
Where the main risk is misuse, we’d need to know that those doing the testing have methods for eliciting capabilities that are as effective as anything people will come up with later. (including the most artful AutoGPT 3.0 setups etc)
It seems reasonable to me to claim that “we know how to do effective [capabilities given sota elicitation methods] evaluations”, but that doesn’t answer the right question.
Once the main risk isn’t misuse, then we have to worry about assumptions breaking down (no exploration hacking / no gradient hacking / [assumption we didn’t realize we were relying upon]). Obviously we don’t expect these to break yet, but I’d guess that we’ll be surprised the first time they do break.
I expect your guess on when they will break to be more accurate than mine—but that [I don’t have much of a clue, so I’m advocating extreme caution] may be the more reasonable policy.
We don’t know how to put the concrete eval in the RSP, but we can certainly require that an eval for understanding passes. We can write in the RSP what the test would be intended to achieve, and conditions for the approval of the eval. E.g. [if at least two of David Krueger, Wei Dai and Abram Demski agree that this meets the bar for this category of understanding eval, then it does] (or whatever other criteria you might want).
Again, only putting targets that are well understood concretely in the RSP seems like a predictable way to fail to address poorly understood problems.
Either the RSP needs to cover the poorly understood problems too—perhaps with a [you can’t pass this check without first coming up with a test and getting it approved] condition, or it needs a “THIS RSP IS INADEQUATE TO ENSURE SAFETY” warning in huge red letters on every page. (if the Anthropic RSP communicates this at all, it’s not emphasized nearly enough)
Setting aside the potential advantages of RSPs, this strikes me as a pretty weird thing to say. I understand the term “pause” in this context to mean that you stop building cutting-edge AI models, either voluntarily or due to a government mandate. In contrast, “RSP” says you eventually do that but you gate it on certain model sizes and test results and unpause it under other test results. This strikes me as a bit less nebulous, but only a bit.
I’m not quite sure what’s going on here—it’s possible that the term “pause” has gotten diluted? Seems unfortunate if so.
I think the problem is that nobody really has an idea for what the resumption condition should be for a pause, and nobody’s willing to defend the (actually actionable) six-month pause proposal.
the FLI letter asked for “pause for at least 6 months the training of AI systems more powerful than GPT-4” and i’m very much willing to defend that!
my own worry with RSPs is that they bake in (and legitimise) the assumptions that a) near term (eval-less) scaling poses trivial xrisk, and b) there is a substantial period during which models trigger evals but are existentially safe. you must have thought about them, so i’m curious what you think.
that said, thank you for the post, it’s a very valuable discussion to have! upvoted.
Sure, but I guess I would say that we’re back to nebulous territory then—how much longer than six months? When if ever does the pause end?
I agree that this is mostly baked in, but I think I’m pretty happy to accept it. I’d very surprised if there was substantial x-risk from the next model generation.
But also I would argue that, if the next generation of models do pose an x-risk, we’ve mostly already lost—we just don’t yet have anything close to the sort of regulatory regime we’d need to deal with that in place. So instead I would argue that we should be planning a bit further ahead than that, and trying to get something actually workable in place further out—which should also be easier to do because of the dynamic where organizations are more willing to sacrifice potential future value than current realized value.
Yeah, I agree that this is tricky. Theoretically, since we can set the eval bar at any capability level, there should exist capability levels that you can eval for and that are safe but scaling beyond them is not. The problem, of course, is whether we can effectively identify the right capabilities levels to evaluate in advance. The fact that different capabilities are highly correlated with each other makes this easier in some ways—lots of different early warning signs will all be correlated—but harder in other ways—the dangerous capabilities will also be correlated, so they could all come at you at once.
Probably the most important intervention here is to keep applying your evals while you’re training your next model generation, so they trigger as soon as possible. As long as there’s some continuity in capabilities, that should get you pretty far. Another thing you can do is put strict limits on how much labs are allowed to scale their next model generation relative to the models that have been definitively evaluated to be safe. And furthermore, my sense is that at least in the current scaling paradigm, the capabilities of the next model generation tend to be relatively predictable given the current model generation.
So overall, my sense is that takeoff only has to be marginally continuous for this to work—if it’s extremely abrupt, more of a classic FOOM scenario, then you might have problems, but I think that’s pretty unlikely.
Thanks! Happy to chat about this more also offline.
i agree that, if hashed out, the end criteria may very well resemble RSPs. still, i would strongly advocate for scaling moratorium until widely (internationally) acceptable RSPs are put in place.
i share the intuition that the current and next LLM generations are unlikely an xrisk. however, i don’t trust my (or anyone else’s) intuitons strongly enough to say that there’s a less than 1% xrisk per 10x scaling of compute. in expectation, that’s killing 80M existing people—people who are unaware that this is happening to them right now.
Do you think if Anthropic (or another leading AGI lab) unilaterally went out of its way to prevent building agents on top of its API, would this reduce the overall x-risk/p(doom) or not? I’m asking because here you seem to assume a defeatist position that only governments are able to shape the actions of the leading AGI labs (which, by the way, are very very few—in my understanding, only 3 or 4 labs have any chance of releasing a “next generation” model for as much as two years from now, others won’t be able to achieve this level of capability even if they tried), but in the post you advocate for the opposite—for voluntary actions taken by the labs, and that regulation can follow.
Probably, but Anthropic is actively working in the opposite direction:
Obviously, Claude 2 as a conversational e-commerce agent is not going to pose catastrophic risk, but it wouldn’t be surprising if building an ecosystem of more powerful AI agents increased the risk that autonomous AI agents cause catastrophic harm.
Is evaluation of capabilities, which as you note requires fine-tuning and other such techniques, a realistic thing to properly do continuously during model training, without that being prohibitively slow or expensive? Would doing this be part of the intended RSP?
Anthropic’s RSP includes evals after every 4x increase in effective compute and after every 3 months, whichever comes sooner, even if this happens during training, and the policy says that these evaluations include fine-tuning.
Do you know why 4x was picked? I understand that doing evals properly is a pretty substantial effort, but once we get up to gigantic sizes and proto-AGIs it seems like it could hide a lot. If there was a model sitting in training with 3x the train-compute of GPT4 I’d be very keen to know what it could do!
maybe “when alignment is solved”
Is the idea that an indefinite pause is unactionable? If so, I’m not sure why you think that.
I talk about that here:
I mean, whether something’s realistic and whether something’s actionable are two different things (both separate from whether something’s nebulous) - even if it’s hard to make a pause happen, I have a decent guess about what I’d want to do to up those odds: protest, write to my congress-person, etc.
As to the realism, I think it’s more realistic than I think you think it is. My impression of AI Impacts’ technological temptation work is that governments are totally willing to enact policies that impoverish their citizens without requiring a rigourous CBA. Early wins does seem like an important consideration, but you can imagine trying to get some early wins by e.g. banning AI from being used in certain domains, banning people from developing advanced AI without doing X, Y, or Z.
Sure—I just think it’d be better to spend that energy advocating for good RSPs instead.
To be clear, the whole point of my post is that I am in favor of pausing/stopping AI development—I just think the best way to do that is via RSPs.
I think AI Safety Levels are a good idea, but evals-based classification needs to be complemented by compute thresholds to mitigate the risks of loss of control via deceptive alignment. Here is a non-nebulous proposal.
On RSPs vs pauses, my basic take is that hardcore pauses are better than RSPs and RSPs are considerably better than weak pauses.
Best: we first prevent hardware progress and stop H100 manufactoring for a bit, then we prevent AI algorithmic progress, and then we stop scaling (ideally in that order). Then, we heavily invest in long run safety research agendas and hold the pause for a long time (20 years sounds good to start). This requires heavy international coordination.
I think good RSPs are worse than this, but probably much better than just having a lab pause scaling.
It’s possible that various actors should explicitly state that hardcore pauses would be better (insofar as they think so).
I propose changing the term for this second type of evaluation to “propensity evaluations”. I think this is a better term as it directly fits the definition you provided: “a model evaluation designed to test under what circumstances a model would actually try to do some task”.
Moreover, I think that both capabilities evaluations and propensity evaluations can be types of safety evaluations. Therefore, it’s misleading to label only one of them as “safety evaluations”. For example, we could construct a compelling safety argument for current models using solely capability evaluations.
Either can be sufficient for safety: a strong argument based on capabilities (we’ve conclusively determined that the AI is too dumb to do anything very dangerous) or a strong argument based on propensity (we have a theoretically robust and empirically validated case that our training process will result in an AI that never attempts to do anything harmful).
Alternatively, a moderately strong argument based on capabilities combined with a moderately strong argument based on propensity can be sufficient, provided that the evidence is sufficiently independent.
I sometimes refer to capability based arguments as control arguments.
Then, we can name two lines of defense:
The control line of defense: Would the AI succeed at causing bad outcomes if it tried?
The propensity line of defense: Would the AI try to cause bad outcomes?
It’s possible to develop techniques which advance either the control line of defense or the propensity line of defense. Of course, many research directions are useful for both.
I expect that in practice, we’re not very close to being able to make good propensity arguments (for instance, theory and interpretability both seem to me to be unlikely to establish this in the next several doublings of total R&D effort). However, we’re not that far off from building quite powerful control based countermeasures. I think these control based countermeasures could scale to establishing barely acceptable safety arguments (e.g. 1-5% doom per year) for around human level AI systems while using these AI systems to accelerate software based R&D (e.g. alignment research) by >30x.
Further, as Evan noted, control style arguments seem much more straightforward to evaluate (though various complications can result from exploration and gradient hacking). So, I think the feedback loops on control look quite promising.
(Redwood Research, where I work, is currently pursuing several control style projects and we should be putting out various things on this soon.)
An important thing to emphasize with control arguments is that it seems quite unlikely that control arguments can be made workable for very superhuman models. (At least for the notion of “control arguments” which can be readily assessed with non-insane capability evaluations.)
AI might also cause bad outcomes as a side-effect of pursuing other goals. A lot of the bad outcomes that powerful agents like companies produce today are not because the company tries to cause bad outcomes.
Sure. I just mean “try to do things which result in bad outcomes from our perspective”.
If you care about prevent AI from causing bad outcomes those are not the same thing. It’s important to be able to distinguish them.
[it turns out I have many questions—please consider this a pointer to the kind of information I’d find useful, rather than a request to answer them all!]
Can you point to what makes you think this is likely? (or why it seems the most promising approach)
In particular, I worry when people think much in terms of “doublings of total R&D effort” given that I’d expect AI assistance progress multipliers to vary hugely—with the lowest multipliers correlating strongly with the most important research directions.
To me it seems that the kind of alignment research that’s plausible to speed up 30x is the kind that we can already do without much trouble—narrowly patching various problems in ways we wouldn’t expect to generalize to significantly superhuman systems.
That and generating a ton of empirical evidence quickly—which is nice, but I expect the limiting factor is figuring out what questions to ask.
It doesn’t seem plausible that we get a nice inductive pattern where each set of patches allows a little more capability safely, which in turn allows more patches.… I’m not clear on when this would fail, but pretty clear that it would fail.
What we’d seem to need is a large speedup on more potentially-sufficiently-general-if-they-work approaches—e.g. MIRI/ARC-theory/JW stuff.
30x speedup on this seems highly unlikely. (I guess you’d agree?)
Even if it were possible to make a month of progress in one day, it doesn’t seem possible to integrate understanding of that work in a day (if the AI is doing the high-level integration and direction-setting, we seem to be out of the [control measures will keep this safe] regime).
I also note that empirically, theoretical teams don’t tend to add a load of very smart humans. I’m sure that Paul could expand a lot more quickly if he thought that was helpful. Likewise MIRI.
Are they making a serious error here, or are the limitations of very-smart-human assistants not going to apply to AI assistants? (granted, I expect AI assistants aren’t going to have personality clashes etc)
Are you expecting sufficiently general alignment solutions to come out of work that doesn’t require deep integrated understanding? Can you point to current work (or properties of current work) that would be examples? Would you guess the things we could radically speed up are sufficient for a solution, or just useful? If the latter, how much painfully-slow-by-comparison work seems likely to be needed?
Or would the hope be that for more theoretical work there’s a significant speedup, even if it’s not 30x? What seems plausible to you here? 5x? Why is this currently not being achieved through human scaling? Is 5x enough to compensate for the risks? What multiplier would be just sufficient to compensate?
What would you consider early evidence of the expected multiplier for theoretical work?
E.g. should we be getting a 3x speedup with current AIs on open, underspecified problems that seem somewhat easier than alignment? Are we? (on anything—not only alignment-relevant things)
My immediate reaction to this kind of approach is that it feels like wishful thinking without much evidence. However, I’m aware that I do aesthetically prefer theoretically motivated approaches—so I don’t entirely trust my reaction.
I can buy being even more pessimistic about the theoretical approaches than getting lucky with software based R&D—but to me this suggests that coordination around a stop might be the best bet.
I’m not going to respond to everything you’re saying here right now. It’s pretty likely I won’t end up responding to everything you’re saying at any point; so apologies for that.
Here are some key claims I want to make:
Serial speed is key: Speeding up theory work (like e.g. ARC theory) by 5-10x should be quite doable with human level AIs due to AIs running at much faster serial speeds. This is a key difference between adding AIs and adding humans. Theory can be hard to parallelize which makes adding humans look worse than increasing speed. I’m not confident in speeding up theory work by >30x with controlled and around human level AI, but this doesn’t seem impossible.
Access to the human level AIs makes safety work much more straightforward: A key difference between current safety work and future safety work is that in the future we’ll have access to the exact AIs we’re worried about. I expect this opens up a bunch of empirical work which is quite useful and relatively easy to scalably automate with AIs. I think this work could extend considerably beyond “patches”. (The hope here is similar to model organisms, but somewhat more general.)
The research target can be trusted human level systems instead of superhuman systems. One possible story for victory goes something like “control of untrusted AIs → trustworthy human level (or slightly superhuman AIs) → [some next target like fully scalable alignment]”. If human researchers are literally fully obsoleted by reasonably trustworthy human level AIs and these AIs can collectively speed up any field of alignment by >30x, we should be in a radically better position. These trustworthy AI researchers could work on fully scalable alignment, control or alignment of the next generation of smarter AIs, or some combination of both of these. Researching how to make trusted human level AIs seems much more tractable than researching how to align wildly superhuman systems (though both are hard to measure).
Depending on just control results in a very bad absolute level of risk, but it still might be our best option. I estimated 1-5% doom per year above, but my exact guess will vary depending on various factors. So coordination to do better than this would be great.
Here are some other less important claims which feed into my overall takes:
Current AIs aren’t useful for theory yet partially because they’re too dumb. They suck at math.
I think part of the problem with current AIs is getting the AIs enough context, but I think this is a doable technical problem which probably needs to be solved for AIs to be human level and I expect this to be solved prior to AIs taking over.
My understanding is that the current work done by ARC theory results in them trying to solve a bunch of relatively straightforward math problems, and if they could solve all of these problems very quickly, this would considerably massively accelerate their work. I expect this to be roughly true going forward due to my understanding of their methodology, but I’m not very confident here.
AIs have other structural advantages beyond serial speed which will make speeding things up with AIs relatively easier than with humans.
This is clarifying, thanks.
A few thoughts:
“Serial speed is key”:
This makes sense, but seems to rely on the human spending most of their time tackling well-defined but non-trivial problems where an AI doesn’t need to be re-directed frequently [EDIT: the preceding was poorly worded—I meant that if prior to the availability of AI assistants this were true, it’d allow a lot of speedup as the AIs take over this work; otherwise it’s less clearly so helpful].
Perhaps this is true for ARC—that’s encouraging (though it does again make me wonder why they don’t employ more mathematicians—surely not all the problems are serial on a single critical path?).
I’d guess it’s less often true for MIRI and John.
Of course once there’s a large speedup of certain methods, the most efficient methodology would look different. I agree that 5x to 10x doesn’t seem implausible.
″...in the future we’ll have access to the exact AIs we’re worried about.”:
We’ll have access to the ones we’re worried about deploying.
We won’t have access to the ones we’re worried about training until we’re training them.
I do buy that this makes safety work for that level of AI more straightforward—assuming we’re not already dead. I expect most of the value is in what it tells us about a more general solution, if anything—similarly for model organisms. I suppose it does seem plausible that this is the first level we see a qualitatively different kind of general reasoning/reflection that leads us in new theoretical directions. (though I note that this makes [this is useful to study] correlate strongly with [this is dangerous to train])
“Researching how to make trustworthy human level AIs seems much more tractable than researching how to align wildly superhuman systems”:
This isn’t clear to me. I’d guess that the same fundamental understanding is required for both. “trustworthy” seems superficially easier than “aligned”, but that’s not obvious in a general context.
I’d expect that implementing the trustworthy human-level version would be a lower bar—but that the same understanding would show us what conditions would need to obtain in either case. (certainly I’m all for people looking for an easier path to the human-level version, if this can be done safely—I’d just be somewhat surprised if we find one)
“So coordination to do better than this would be great”.
I’d be curious to know what you’d want to aim for here—both in a mostly ideal world, and what seems most expedient.
As far as the ideal, I happened to write something about in another comment yesterday. Excerpt:
As far as expedient, something like:
Demand labs have good RSPs (or something similar) using inside and outside game, try to get labs to fill in tricky future details of these RSPs as early as possible without depending on “magic” (speculative future science which hasn’t yet been verified). Have AI takeover motivated people work on the underlying tech and implementation.
Work on policy and aim for powerful US policy interventions in parallel. Other countries could also be relevant.
Both of these are unlikely to perfectly succeed, but seems like good directions to push on.
I think pushing for AI lab scaling pauses is probably net negative right now, but I don’t feel very strongly either way (it mostly just feels not that leveraged overall). I think slowing down hardware progress seems clearly good if we could do it at low cost, but seems super intractible.
Thanks, this seems very reasonable. I’d missed your other comment.
(Oh and I edited my previous comment for clarity: I guess you were disagreeing with my clumsily misleading wording, rather than what I meant(??))
Corresponding comment text:
I think I disagree with what you meant, but not that strongly. It’s not that important, so I don’t really want to get into it. Basically, I don’t think that “well-defined” is that important (not obviously required for some ability to judge the finished work) and I don’t think “re-direction frequency” is the right way to think about.
Resume when the scientific community has a much clearer idea about how to build AGIs that don’t pose a large extinction risk for humanity. This consideration can’t be turned into a benchmark right now, hence the technical necessity for a pause to remain nebulous.
RSPs are great, but not by themselves sufficient. Any impression that they are sufficient bundles irresponsible neglect of the less quantifiable risks with the useful activity of creating benchmarks.
(comment crossposted from EA forum)
Very interesting post! But I’d like to push back. The important things about a pause, as envisaged in the FLI letter, for example, are that (a) it actually happens, and (b) the pause is not lifted until there is affirmative demonstration that the risk is lifted. The FLI pause call was not, in my view, on the basis of any particular capability or risk, but because of the out-of-control race to do larger giant scaling experiments without any reasonable safety assurances. This pause should still happen, and it should not be lifted until there is a way in place to assure that safety. Many of the things FLI hoped could happen during the pause are happening — there is huge activity in the policy space developing standards, governance, and potentially regulations. It’s just that now those efforts are racing the un-paused technology.
In the case of “responsible scaling” (for which I think the ideas of “controlled scaling” or “safety-first scaling” would be better), what I think is very important is that there not be a presumption that the pause will be temporary, and lifted “once” the right mitigations are in place. We may well hit point (and may be there now), where it is pretty clear that we don’t know how to mitigate the risks of the next generation of systems we are building (and it may not even be possible), and new bigger ones should not be built until we can do so. An individual company pausing “until” it believes things are safe is subject to the exact same competitive pressures that are driving scaling now — both against pausing, and in favor of lifting a pause as quickly as possible. If the limitations on scaling come from the outside, via regulation or oversight, then we should ask for something stronger: before proceeding, show to those outside organizations that scaling is safe. The pause should not be lifted until or unless that is possible. And that’s what the FLI pause letter asks for.
Once labs are trying to pass capability evaluations, they will spend effort trying to suppress the specific capabilities being evaluated*, so I think we’d expect them to stop being so highly correlated.
* If they try methods of more generally suppressing the kinds of capabilities that might be dangerous, I think they’re likely to test them most on the capabilities being evaluated by RSPs.
My guess is that the hard “Pause” advocates are focussed on optimizing for actions that are “necessary” and “sufficient” for safety, but at the expense of perhaps not being politically or practically “feasible”.
Whereas, “Responsible Scaling Policies” advocates may instead describe actions that are “necessary”, and more “feasible” however are less likely to be “sufficient”.
The crux of this disagreement might be related to how feasible, or how sufficient each of these two pathways respectively are?
Absent any known pathways that solve all three, I’m glad people are exploring both of these pathways (and the potential overlap between them). I hope that there is increased exploration.
Perhaps we are going through a temporary phase of increased contention between Pauses versus RSPs as they both may be vying for similar memetic uptake (e.g. on the lesswrong home page right now there is a link for “Global Pause AI Protest” events spread across seven countries happening a few days from now.)
(Conflict of interest: I support implementation of Anthropic’s Responsible Scaling Policy)
From my reading of ARC Evals’ example of a “good RSP”, RSPs set a standard that roughly looks like: “we will continue scaling models and deploying if and only if our internal evals team fails to empirically elicit dangerous capabilities. If they do elicit dangerous capabilities, we will enact safety controls just sufficient for our models to be unsuccessful at, e.g., creating Super Ebola.”
This is better than a standard of “we will scale and deploy models whenever we want,” but still has important limitations. As noted by the “coordinated pausing” paper, it would be problematic if “frontier AI developers and other stakeholders (e.g. regulators) rely too much on evaluations and coordinated pausing as their main intervention to reduce catastrophic risks from AI.”
Some limitations:
Misaligned incentives. The evaluation team may have an incentive to find fewer dangerous capabilities than possible. When findings of dangerous capabilities could lead to timeline delays, public criticism, and lost revenue for the company, an internal evaluation team has a conflict of interest. Even with external evaluation teams, AI labs may choose whichever one is most favorable or inexperienced (e.g., choosing an inexperienced consulting team).
Underestimating risk. Pre-deployment evaluations underestimate the potential risk after deployment. A small evaluation team, which may be understaffed, is unlikely to exhaust all the ways to enhance a model’s capabilities for dangerous purposes, compared to what the broader AI community could do after a model is deployed to the public. The most detailed evaluation report to date, ARC Evals’ evaluation report on realistic autonomous tasks, notes that it does not bound the model’s capabilities at these tasks.
For example, suppose that an internal evaluations team has to assess dangerous capabilities before the lab deploys a next-generation AI model. With only one month to assess the final model, they find that even with fine-tuning and available AI plugins, the AI model reliably fails to replicate itself, and conclude that there is minimal risk of autonomous replication. The AI lab releases the model and with the hype from the new model, AI deployment becomes a more streamlined process, new tools are built for AIs to navigate the internet, and comprehensive fine-tuning datasets are commissioned to train AIs to make money for themselves with ease. The AI is now able to easily autonomously replicate, even where the past generation still fails to do so. The goalposts shift so that AI labs are worried only about autonomous replication if the AI can also hack its weights and self-exfiltrate.
Not necessarily compelling. Evaluations finding dangerous capabilities may not be perceived as a compelling reason to pause or enact stronger safety standards across the industry. Experiments by an evaluation team do not reflect real-world conditions and may be dismissed as unrealistic. Some capabilities such as autonomous replication may be seen as overly abstract and detached from evidence of real-world harm, especially for politicians that care about concrete concerns from constituents. Advancing dangerous capabilities may even be desirable for some stakeholders, such as the military, and reason to race ahead in AI development.
Safety controls may be minimal. The safety controls enacted in response to dangerous-capability evaluations could be relatively minimal and brittle, falling apart in real-world usage, as long as it meets the standards of the responsible scaling policy.
There are other factors that can motivate policy for adopt strong safety standards, besides empirical evaluations of extreme risk. Rather than requiring safety only when AIs demonstrate extreme risk (e.g., killing millions with a pandemic), governments are already considering preventing them from engaging in illegal activities. China recently passed legislation to prevent generative AI services from generating illegal content, and the EU AI Act has a similar proposal in Article 28b. While these provisions are focused on AI agents rather than generative AI, it seems feasible to set a standard for AIs to be generally law-abiding (even after jailbreaking or fine-tuning attempts), which would also help reduce their potential contribution to catastrophic risk. Setting liability for AI harms, as proposed by Senators Blumenthal and Hawley, would also motivate AI labs to be more cautious. We’ve seen lobbying from OpenAI and Google to change the EU AI Act to shift away the burden of making AIs safe to downstream applications (see response letter from the AI Now Institute, signed by several researchers at GovAI). Lab-friendly policy like RSPs may predictably underinvest in measures that regulate current and near-future models.
I think it’s pretty clear that’s at least not what I’m advocating for—I have a very specific story of how I think RSPs go well in my post.
These seem like good interventions to me! I’m certainly not advocating for “RSPs are all we need”.
The problem with a naive implementation of RSPs is that we’re trying to build a safety case for a disaster that we fundamentally don’t understand and where we haven’t even produced a single disaster example or simulation.
To be more specific, we don’t know exactly which bundles of AI capabilities and deployments will eventually result in a negative outcome for humans. Worse, we’re not even trying to answer that question—nobody has run an “end of the world simulator” and as far as I am aware there are no plans to do that.
Without such a model it’s very difficult to do expected utility maximization with respect to AGI scaling, deployment, etc.
Safety is a global property, not a local property. We have some surface-level understanding of this from events like The Arab Spring or World War I. Was Europe in 1913 “safe”? Apparently not, but it wasn’t obvious to people.
What will happen if and when someone makes AI systems that are emotionally compelling to people and demand sentient rights for AIs? How do you run a safety eval for that? What are the consequences for humanity if we let AI systems vote in elections, run for office, start companies or run mainstream news orgs and popular social media accounts? What is the endgame of that world, and does it include any humans?
Can you clarify whether this is implying that open-source capability benchmark thresholds will be at the same or similar levels to closed-source ones? That is how I initially read it, but not sure that it’s the intended meaning.
More thoughts that are only semi-relevant if I misunderstood below.
------------------------------------------------------------------------------------------------------------------------------------------------------------------
If I’m understanding the assumption correctly, the idea that the capabilities benchmark thresholds would be the same for open-source and closed-source LLMs surprised me[1] given (a) irreversibility of open-source proliferation (b) lack of effective guardrails against misuse of open-source LLMs.
Perhaps the implicit argument is that labs should assume their models will be leaked when doing risk evaluations unless they have insanely good infosec so they should effectively treat their models as open-source. Anthropic does say in their RSP:
This makes some sense to me, but looking at the definition of ASL-3 as if the model is effectively open-sourced:
I understand that limiting to only 1% of the model costs and only existing post-training techniques makes it more tractable to measure the risk, but it strikes me as far from a conservative bound if we are assuming the model might be stolen and/or leaked. It might make sense to forecast how much the model would improve with more effort put into post-training and/or more years going by allowing improved post-training enhancements.
Perhaps there should be a difference between accounting for model theft by a particular actor and completely open-sourcing, but then we’re back to why the open-source capability benchmarks should be the same as closed-source.
This is not to take a stance on the effect of open-sourcing LLMs at current capabilities levels, but rather being surprised that the capability threshold for when open-source is too dangerous would be the same as closed-source.
No, you’d want the benchmarks to be different for open-source vs. closed-source, since there are some risks (e.g. bio-related misuse) that are much scarier for open-source models. I tried to mention this here:
I’ll edit the post to be more clear on this point.
If the model is smart enough, you die before writing the evals report; if it’s just kinda smart, you don’t find it to be too intelligent and die after launching your scalable oversight system that, as a whole, is smarter than individual models.
An international moratorium on all training runs that could stumble on something that might kill everyone is much more robust than regulations around evaluated capabilities of already trained models
Edit: Huh. I would’ve considered the above to be relatively uncontroversial on LessWrong. Can someone explain where I’m wrong?
If the model is smart enough, you die before writing the evals report; if it’s just kinda smart, you don’t find it to be too intelligent and die after launching your scalable oversight system that, as a whole, is smarter than individual models.
An international moratorium on all training runs that could stumble on something that might kill everyone is much more robust than regulations around evaluated capabilities of already trained models