Thanks for writing this, Evan! I think it’s the clearest writeup of RSPs & their theory of change so far. However, I remain pretty disappointed in the RSP approach and the comms/advocacy around it.
I plan to write up more opinions about RSPs, but one I’ll express for now is that I’m pretty worried that the RSP dialogue is suffering from motte-and-bailey dynamics. One of my core fears is that policymakers will walk away with a misleadingly positive impression of RSPs. I’ll detail this below:
What would a good RSP look like?
Clear commitments along the lines of “we promise to run these 5 specific tests to evaluate these 10 specific dangerous capabilities.”
Clear commitments regarding what happens if the evals go off (e.g., “if a model scores above a 20 on the Hubinger Deception Screener, we will stop scaling until it has scored below a 10 on the relatively conservative Smith Deception Test.”)
Clear commitments regarding the safeguards that will be used once evals go off (e.g., “if a model scores above a 20 on the Cotra Situational Awareness Screener, we will use XYZ methods and we believe they will be successful for ABC reasons.”)
Clear evidence that these evals will exist, will likely work, and will be conservative enough to prevent catastrophe
Some way of handling race dynamics (such that Bad Guy can’t just be like “haha, cute that you guys are doing RSPs. We’re either not going to engage with your silly RSPs at all, or we’re gonna publish our own RSP but it’s gonna be super watered down and vague”).
What do RSPs actually look like right now?
Fairly vague commitments, more along the lines of “we will improve our information security and we promise to have good safety techniques. But we don’t really know what those look like.
Unclear commitments regarding what happens if evals go off (let alone what evals will even be developed and what they’ll look like). Very much a “trust us; we promise we will be safe. For misuse, we’ll figure out some way of making sure there are no jailbreaks, even though we haven’t been able to do that before.”
Also, for accident risks/AI takeover risks… well, we’re going to call those “ASL-4 systems”. Our current plan for ASL-4 is “we don’t really know what to do… please trust us to figure it out later. Maybe we’ll figure it out in time, maybe not. But in the meantime, please let us keep scaling.”
Extremely high uncertainty about what safeguards will be sufficient. The plan essentially seems to be “as we get closer to highly dangerous systems, we will hopefully figure something out.”
No strong evidence that these evals will exist in time or work well. The science of evaluations is extremely young, the current evals are more like “let’s play around and see what things can do” rather than “we have solid tests and some consensus around how to interpret them.”
No way of handling race dynamics absent government intervention. In fact, companies are allowed to break their voluntary commitments if they’re afraid that they’re going to lose the race to a less safety-conscious competitor. (This is explicitly endorsed in ARC’s post and Anthropic includes such a clause.)
Important note: I think several of these limitations are inherent to current gameboard. Like, I’m not saying “I think it’s a bad move for Anthropic to admit that they’ll have to break their RSP if some Bad Actor is about to cause a catastrophe.” That seems like the right call. I’m also not saying that dangerous capability evals are bad—I think it’s a good bet for some people to be developing them.
Why I’m disappointed with current comms around RSPs
Instead, my central disappointment comes from how RSPs are being communicated. It seems to me like the main three RSP posts (ARC’s, Anthropic’s, and yours) are (perhaps unintentionally?) painting and overly-optimistic portrayal of RSPs. I don’t expect policymakers that engage with the public comms to walk away with an appreciation for the limitations of RSPs, their current level of vagueness + “we’ll figure things out later”ness, etc.
On top of that, the posts seem to have this “don’t listen to the people who are pushing for stronger asks like moratoriums—instead please let us keep scaling and trust industry to find the pragmatic middle ground” vibe. To me, this seems not only counterproductive but also unnecessarily adversarial. I would be more sympathetic to the RSP approach if it was like “well yes, we totally think it’d great to have a moratorium or a global compute cap or a kill switch or a federal agency monitoring risks or a licensing regime”, and we also think this RSP thing might be kinda nice in the meantime. Instead, ARC implies that the moratorium folks are unrealistic, and tries to say they operate on an extreme end of the spectrum, on the opposite side of those who believe it’s too soon to worry about catastrophes whatsoever.
(There’s also an underlying thing here where I’m like “the odds of achieving a moratorium, or a licensing regime, or hardware monitoring, or an agency that monitors risks and has emergency powers— the odds of meaningful policy getting implemented are not independent of our actions. The more that groups like Anthropic and ARC claim “oh that’s not realistic”, the less realistic those proposals are. I think people are also wildly underestimating the degree to which Overton Windows can change and the amount of uncertainty there currently is among policymakers, but this is a post for another day, perhaps.)
I’ll conclude by noting that some people have went as far as to say that RSPs are intentionally trying to dilute the policy conversation. I’m not yet convinced this is the case, and I really hope it’s not. But I’d really like to see more coming out of ARC, Anthropic, and other RSP-supporters to earn the trust of people who are (IMO reasonably) suspicious when scaling labs come out and say “hey, you know what the policy response should be? Let us keep scaling, and trust us to figure it out over time, but we’ll brand it as this nice catchy thing called Responsible Scaling.”
My main disagreement is that I don’t think the “What would a good RSP look like?” description is sufficient without explicit conditions beyond evals. In particular that we should expect that our suite of tests will be insufficient at some point, absent hugely improved understanding—and that we shouldn’t expect to understand how and why it’s insufficient before reality punches us in the face.
Therefore, it’s not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them to work as intended].
We also need strong evidence that there’ll be no catastrophe-inducing problems we didn’t think of. (not [none that SotA methods could find], not [none discoverable with reasonable cost]; none)
This can’t be implicit, since it’s a central way that we die.
If it’s hard/impractical to estimate, then we should pause until we can estimate it more accurately
This is the kind of thing that I expect to be omitted from RSPs as a matter of course, precisely because we lack the understanding to create good models/estimates, legible tests etc. That doesn’t make it ok. Blank map is not blank territory.
If we’re thinking of better mechanisms to achieve a pause, I’d add:
Call it something like a “Responsible training and deployment policy (RTDP)”, not an RSP. Scaling is the thing in question. We should remove it from the title if we want to give the impression that it might not happen. (compare “Responsible farming policy”, “Responsible fishing policy”, “Responsible diving policy”—all strongly imply that responsible x-ing is possible, and that x-ing will continue to happen subject to various constraints)
Don’t look for a ‘practical’ solution. A serious pause/stop will obviously be impractical (yet not impossible). To restrict ourselves to practical approaches is to give up on any meaningful pause. Doing the impractical is not going to get easier later.
Define ASLs or similar now rather than waiting until we’re much closer to achieving them. Waiting to define them later gives the strong impression that the approach is [pick the strongest ASL definitions and measures that will achievable so that we can keep scaling] and not, [pick ASL definitions and measures that are clearly sufficient for safety].
Evan’s own “We need to make sure that, once we have solid understanding-based evals, governments make them mandatory” only re-enforces this impression. Whether we have them is irrelevant to the question of whether they’re necessary.
Be clear and explicit about the potential for very long pauses, and the conditions that would lead to them. Where it’s hard to give precise conditions, give high-level conditions and very conservative concrete defaults (not [reasonably conservative]; [unreasonably conservative]). Have a policy where a compelling, externally reviewed argument is necessary before any conservative default can be relaxed.
I think there’s a big danger in safety people getting something in place that we think/hope will imply a later pause, only to find that when it really counts the labs decide not to interpret things that way and to press forward anyway—with government/regulator backing, since they’re doing everything practical, everything reasonable.... Assuming this won’t happen seems dangerously naive.
If labs are going to re-interpret the goalposts and continue running into the minefield, we need to know this as soon as possible. This requires explicit clarity over what is being asked / suggested / eventually-entailed. The Anthropic RSP fails at this IMO: no understanding-based requirements; no explicit mention that pausing for years may be necessary. The ARC Evals RSP description similarly fails—if RSPs are intended to be a path to pausing. “Practical middle ground” amounts to never realistically pausing. They entirely overlook overconfidence as a problem. (frankly, I find this confusing coming from Beth/Paul et al)
Have separate RTDPs for unilateral adoption, and adoption subject to multi-lab agreement / international agreement etc. (I expect at least three levels would make sense)
This is a natural way to communicate “We’d ideally like [very strict measures], though [less strict measures] are all we can commit to unilaterally”.
If a lab’s unilateral RTDP looks identical to their [conditional on international agreement] RTDP, then they have screwed up.
Strongly consider pushing for safety leads to write and sign the RTDP (with help, obviously). I don’t want the people who know most about safety to be “involved in the drafting process”; I want to know that they oversaw the process and stand by the final version.
I’m sure there are other sensible additions, but that’d be a decent start.
Therefore, it’s not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them to work as intended]. We also need strong evidence that there’ll be no catastrophe-inducing problems we didn’t think of. (not [none that SotA methods could find], not [none discoverable with reasonable cost]; none)
This can’t be implicit, since it’s a central way that we die.
If it’s hard/impractical to estimate, then we should pause until we can estimate it more accurately
This is the kind of thing that I expect to be omitted from RSPs as a matter of course, precisely because we lack the understanding to create good models/estimates, legible tests etc. That doesn’t make it ok. Blank map is not blank territory.
Yeah, I agree—that’s why I’m specifically optimistic about understanding-based evals, since I think they actually have the potential to force us to catch unknown unknowns here, the idea being that they require you to prove that you actually understand your model to a level where you’d know if there were anything wrong that your other evals might miss.
Define ASLs or similar now rather than waiting until we’re much closer to achieving them. Waiting to define them later gives the strong impression that the approach is [pick the strongest ASL definitions and measures that will achievable so that we can keep scaling] and not, [pick ASL definitions and measures that are clearly sufficient for safety].
Evan’s own “We need to make sure that, once we have solid understanding-based evals, governments make them mandatory” only re-enforces this impression. Whether we have them is irrelevant to the question of whether they’re necessary.
See the bottom of this comment: my main objection here is that if we were to try to define it now, we’d end up defining something easily game-able because we don’t yet have metrics for understanding that aren’t easily game-able. So if we want something that will actually be robust, we have to wait until we know what that something might be—and ideally be very explicit that we don’t yet know what we could put there.
I think there’s a big danger in safety people getting something in place that we think/hope will imply a later pause, only to find that when it really counts the labs decide not to interpret things that way and to press forward anyway—with government/regulator backing, since they’re doing everything practical, everything reasonable.…
Assuming this won’t happen seems dangerously naive.
I definitely agree that this is a serious concern! That’s part of why I’m writing this post: I want more public scrutiny and pressure on RSPs and their implementation to try to prevent this sort of thing.
Have separate RTDPs for unilateral adoption, and adoption subject to multi-lab agreement / international agreement etc. (I expect at least three levels would make sense)
IANAL, but I think that this is currently impossible due to anti-trust regulations. The White House would need to enact a safe harbor policy for anti-trust considerations in the context of AI safety to make this possible.
IANAL, but I think that this is currently impossible due to anti-trust regulations.
I don’t know anything about anti-trust enforcement, but it seems to me that this might be a case where labs should do it anyways & delay hypothetical anti-trust enforcement by fighting in court.
I happen to think that the Anthropic RSP is fine for what it is, but it just doesn’t actually make any interesting claims yet. The key thing is that they’re committing to actually having an ASL-4 criteria and safety argument in the future. From my perspective, the Anthropic RSP effectively is an outline for the sort of thing an RSP could be (run evals, have safety buffer, assume continuity, etc) as well as a commitment to finish the key parts of the RSP later. This seems ok to me.
I would preferred if they included tentative proposals for ASL-4 evaluations and what their current best safety plan/argument for ASL-4 looks like (using just current science, no magic). Then, explain that plan wouldn’t be sufficient for reasonable amounts of safety (insofar as this is what they think).
Right now, they just have a bulleted list for ASL-4 countermeasures, but this is the main interesting thing at me. (I’m not really sold on substantial risk from systems which aren’t capable of carrying out that harm mostly autonomously, so I don’t think ASL-3 is actually important except as setup.)
Clear commitments along the lines of “we promise to run these 5 specific tests to evaluate these 10 specific dangerous capabilities.”
Clear commitments regarding what happens if the evals go off (e.g., “if a model scores above a 20 on the Hubinger Deception Screener, we will stop scaling until it has scored below a 10 on the relatively conservative Smith Deception Test.”)
Clear commitments regarding the safeguards that will be used once evals go off (e.g., “if a model scores above a 20 on the Cotra Situational Awareness Screener, we will use XYZ methods and we believe they will be successful for ABC reasons.”)
Clear evidence that these evals will exist, will likely work, and will be conservative enough to prevent catastrophe
Some way of handling race dynamics (such that Bad Guy can’t just be like “haha, cute that you guys are doing RSPs. We’re either not going to engage with your silly RSPs at all, or we’re gonna publish our own RSP but it’s gonna be super watered down and vague”).
Yeah, of course this would be nice. But the reason that ARC and Anthropic didn’t write this ‘good RSP’ isn’t that they’re reckless, but because writing such an RSP is a hard open problem. It would be great to have “specific tests” for various dangerous capabilities, or “Some way of handling race dynamics,” but nobody knows what those are.
Of course the specific object-level commitments Anthropic has made so far are insufficient. (Fortunately, they committed to make more specific object-level commitments before reaching ASL-3, and ASL-3 is reasonably well-specified [edit: and almost certainly below x-catastrophe-level].) I praise Anthropic’s RSP and disagree with your vibe because I don’t think you or I or anyone else could write much better commitments. (If you have specific commitments-labs-should-make in mind, please share them!)
(Insofar as you’re just worried about comms and what-people-think-about-RSPs rather than how-good-RSPs-are, I’m agnostic.)
But the reason that ARC and Anthropic didn’t write this ‘good RSP’ isn’t that they’re reckless, but because writing such an RSP is a hard open problem
I agree that writing a good RSP is a hard open problem. I don’t blame ARC for not having solved the “how can we scale safely” problem. I am disappointed in ARC for communicating about this poorly (in their public blog post, and [speculative/rumor-based] maybe in their private government advocacy as well.).
I’m mostly worried about the comms/advocacy/policy implications. If Anthropic and ARC had come out and said “look, we have some ideas, but the field really isn’t mature enough and we really don’t know what we’re doing, and these voluntary commitments are clearly insufficient, but if you really had to ask us for our best-guesses RE what to do if there is no government regulation coming, and for some reason we had to keep racing toward god-like AI, here are our best guesses. But please note that this is woefully insufficient and we would strongly prefer government intervention to buy enough time so that we can have actual plans.”
I also expect most of the (positive or negative) impact of the recent RSP posts to come from the comms/advocacy/policy externalities.
I don’t think you or I or anyone else could write much better commitments
I don’t think the question of whether you or I could write better commitments is very relevant. My claim is more like “no one can make a good enough RSP right now, so instead of selling governments on RSPs, we should be communicating clearly that the current race to godlike AI is awful, our AIS ideas are primitive, we might need to pause for decades, and we should start developing the hardware monitoring//risk assessment//emergency powers//kill switch infrastructure//international institutions that we will need.”
But if I had to answer this directly: I actually do think that if I spent 1-2 weeks working on coming up with better commitments, and I could ask for feedback from like 1-3 advisors of my choice, I could probably come up with “better” commitments. I don’t think this is because I’m particularly smart, though– I just think the bar is low. My impression is that the 22-page doc from Anthropic didn’t actually have many commitments.
The main commitments that stood out to me were: (a) run evals [exact evals unspecified] at least every 4X in effective compute, (b) have good infosec before you have models that can enable bioweapons or other Scary Misuse models, and (c) define ASL-4 criteria once you have Scary Misuse models. There are some other more standard/minor things as well like sharing vulnerabilities with other labs, tiered model access, and patching jailbreaks [how? and how much is sufficient?].
The main caveat I’ll add is that “better” is a fuzzy term in this context. Like, I’m guessing lot of the commitments I’d come up with are things that are more costly from Anthropic’s POV. So maybe many of them would be “worse” in the sense that Anthropic wouldn’t be willing to adopt them, or would argue that other labs are not going to adopt them, therefore they can’t adopt them otherwise they’ll be less likely to win the race.)
It seems to me like the main three RSP posts (ARC’s, Anthropic’s, and yours) are (perhaps unintentionally?) painting and overly-optimistic portrayal of RSPs.
I mean, I am very explicitly trying to communicate what I see as the success story here. I agree that there are many ways that this could fail—I mention a bunch of them in the last section—but I think that having a clear story of how things could go well is important to being able to work to actually achieve that story.
On top of that, the posts seem to have this “don’t listen to the people who are pushing for stronger asks like moratoriums—instead please let us keep scaling and trust industry to find the pragmatic middle ground” vibe.
I want to be very clear that I’ve been really happy to see all the people pushing for strong asks here. I think it’s a really valuable thing to be doing, and what I’m trying to do here is not stop that but help it focus on more concrete asks.
I would be more sympathetic to the RSP approach if it was like “well yes, we totally think it’d great to have a moratorium or a global compute cap or a kill switch or a federal agency monitoring risks or a licensing regime”, and we also think this RSP thing might be kinda nice in the meantime.
To be clear, I definitely agree with this. My position is not “RSPs are all we need”, “pauses are bad”, “pause advocacy is bad”, etc.—my position is that getting good RSPs is an effective way to implement a pause: i.e. “RSPs are pauses done right.”
To be clear, I definitely agree with this. My position is not “RSPs are all we need”, “pauses are bad”, “pause advocacy is bad”, etc.—my position is that getting good RSPs is an effective way to implement a pause: i.e. “RSPs are pauses done right.”
Some feedback on this: my expectation upon seeing your title was that you would argue, or that you implicitly believe, that RSPs are better than other current “pause” attempts/policies/ideas. I think this expectation came from the common usage of the phrase “done right” to mean that other people are doing it wrong or at least doing it suboptimally.
I mean, to be clear, I am saying something like “RSPs are the most effective way to implement a pause that I know of.” The thing I’m not saying is just that “RSPs are the only policy thing we should be doing.”
This reads as some sort of confused motte and bailey. Are RSPs “an effective way” or “the most effective way… [you] know of”? These are different things, with each being stronger/weaker in different ways. Regardless, the title could still be made much more accurate to your beliefs, e.g. ~’RSPs are our (current) best bet on a pause’. ‘An effective way’ is definitely not “i.e … done right”, but “the most effective way… that I know of” is also not.
‘An effective way’ is definitely not “i.e … done right”, but “the most effective way… that I know of” is also not.
I disagree? I think the plain English meaning of the title “RSPs are pauses done right” is precisely “RSPs are the right way to do pauses (that I know of)” which is exactly what I think and exactly what I am defending here. I honestly have no idea what else that title would mean.
Sorry yeah I could have explained what I meant further. The way I see it:
‘X is the most effective way that I know of’ = X tops your ranking of the different ways, but could still be below a minimum threshold (e.g. X doesn’t have to even properly work, it could just be less ineffective than all the rest). So one could imagine someone saying “X is the most effective of all the options I found and it still doesn’t actually do the job!”
‘X is an effective way’ = ‘X works, and it works above a certain threshold’.
‘X is Y done right’ = ‘X works and is basically the only acceptable way to do Y,’ where it’s ambiguous or contextual as to whether ‘acceptable’ means that it at least works, that it’s effective, or sth like ‘it’s so clearly the best way that anyone doing the 2nd best thing is doing something bad’.
“So one extreme side of the spectrum is build things as fast as possible, release things as much as possible, maximize technological progress [...].
The other extreme position, which I also have some sympathy for, despite it being the absolutely opposite position, is you know, Oh my god this stuff is really scary.
The most extreme version of it was, you know, we should just pause, we should just stop, we should just stop building the technology for, indefinitely, or for some specified period of time. [...] And you know, that extreme position doesn’t make much sense to me either.”
Dario Amodei, Anthropic CEO, explaining his company’s “Responsible Scaling Policy” on the Logan Bartlett Podcast on Oct 6, 2023.
This example is not a claim by ARC though, seems important to keep track of this in a discussion of what ARC did or didn’t claim, even as others making such claims is also relevant.
RSPs offer a potential middle ground between (a) those who think AI could be extremely dangerous and seek things like moratoriums on AI development, and (b) who think that it’s too early to worry about capabilities with catastrophic potential. RSPs are pragmatic and threat-model driven: rather than arguing over the likelihood of future dangers, we can...
I think “extreme” was subjective and imprecise wording on my part, and I appreciate you catching this. I’ve edited the sentence to say “Instead, ARC implies that the moratorium folks are unrealistic, and tries to say they operate on an extreme end of the spectrum, on the opposite side of those who believe it’s too soon to worry about catastrophes whatsoever.”
Going forward (through the 2020s), it’s really important to avoid underestimating the ratio of money going into facilitating an AI pause vs money subverting or thwarting an AI pause. The impression I get is that the vast majority of people are underestimating how much money and talent will end up being allocated towards the end of subverting or thwarting an AI pause, e.g. finding galaxy-brained ways to intimidate or mislead well-intentioned AI safety orgs into self-sabotage (e.g. opposing policies that are actually feasible or even mandatory for human survival like an AI pause) or being turned against eachother (which is unambiguously the kind of thing that happens in a world with very high lawyers-per-capita, and in particular in issues and industries where lots of money is at stake). False alarms are almost an equally serious issue because false alarms also severely increase vulnerability, which further incentivises adverse actions against the AI safety community by outsider third parties (e.g. due to signalling high payoff and low risk of detection for any adverse actions).
Thanks for writing this, Evan! I think it’s the clearest writeup of RSPs & their theory of change so far. However, I remain pretty disappointed in the RSP approach and the comms/advocacy around it.
I plan to write up more opinions about RSPs, but one I’ll express for now is that I’m pretty worried that the RSP dialogue is suffering from motte-and-bailey dynamics. One of my core fears is that policymakers will walk away with a misleadingly positive impression of RSPs. I’ll detail this below:
What would a good RSP look like?
Clear commitments along the lines of “we promise to run these 5 specific tests to evaluate these 10 specific dangerous capabilities.”
Clear commitments regarding what happens if the evals go off (e.g., “if a model scores above a 20 on the Hubinger Deception Screener, we will stop scaling until it has scored below a 10 on the relatively conservative Smith Deception Test.”)
Clear commitments regarding the safeguards that will be used once evals go off (e.g., “if a model scores above a 20 on the Cotra Situational Awareness Screener, we will use XYZ methods and we believe they will be successful for ABC reasons.”)
Clear evidence that these evals will exist, will likely work, and will be conservative enough to prevent catastrophe
Some way of handling race dynamics (such that Bad Guy can’t just be like “haha, cute that you guys are doing RSPs. We’re either not going to engage with your silly RSPs at all, or we’re gonna publish our own RSP but it’s gonna be super watered down and vague”).
What do RSPs actually look like right now?
Fairly vague commitments, more along the lines of “we will improve our information security and we promise to have good safety techniques. But we don’t really know what those look like.
Unclear commitments regarding what happens if evals go off (let alone what evals will even be developed and what they’ll look like). Very much a “trust us; we promise we will be safe. For misuse, we’ll figure out some way of making sure there are no jailbreaks, even though we haven’t been able to do that before.”
Also, for accident risks/AI takeover risks… well, we’re going to call those “ASL-4 systems”. Our current plan for ASL-4 is “we don’t really know what to do… please trust us to figure it out later. Maybe we’ll figure it out in time, maybe not. But in the meantime, please let us keep scaling.”
Extremely high uncertainty about what safeguards will be sufficient. The plan essentially seems to be “as we get closer to highly dangerous systems, we will hopefully figure something out.”
No strong evidence that these evals will exist in time or work well. The science of evaluations is extremely young, the current evals are more like “let’s play around and see what things can do” rather than “we have solid tests and some consensus around how to interpret them.”
No way of handling race dynamics absent government intervention. In fact, companies are allowed to break their voluntary commitments if they’re afraid that they’re going to lose the race to a less safety-conscious competitor. (This is explicitly endorsed in ARC’s post and Anthropic includes such a clause.)
Important note: I think several of these limitations are inherent to current gameboard. Like, I’m not saying “I think it’s a bad move for Anthropic to admit that they’ll have to break their RSP if some Bad Actor is about to cause a catastrophe.” That seems like the right call. I’m also not saying that dangerous capability evals are bad—I think it’s a good bet for some people to be developing them.
Why I’m disappointed with current comms around RSPs
Instead, my central disappointment comes from how RSPs are being communicated. It seems to me like the main three RSP posts (ARC’s, Anthropic’s, and yours) are (perhaps unintentionally?) painting and overly-optimistic portrayal of RSPs. I don’t expect policymakers that engage with the public comms to walk away with an appreciation for the limitations of RSPs, their current level of vagueness + “we’ll figure things out later”ness, etc.
On top of that, the posts seem to have this “don’t listen to the people who are pushing for stronger asks like moratoriums—instead please let us keep scaling and trust industry to find the pragmatic middle ground” vibe. To me, this seems not only counterproductive but also unnecessarily adversarial. I would be more sympathetic to the RSP approach if it was like “well yes, we totally think it’d great to have a moratorium or a global compute cap or a kill switch or a federal agency monitoring risks or a licensing regime”, and we also think this RSP thing might be kinda nice in the meantime. Instead, ARC implies that the moratorium folks are unrealistic, and tries to say they operate on an extreme end of the spectrum, on the opposite side of those who believe it’s too soon to worry about catastrophes whatsoever.
(There’s also an underlying thing here where I’m like “the odds of achieving a moratorium, or a licensing regime, or hardware monitoring, or an agency that monitors risks and has emergency powers— the odds of meaningful policy getting implemented are not independent of our actions. The more that groups like Anthropic and ARC claim “oh that’s not realistic”, the less realistic those proposals are. I think people are also wildly underestimating the degree to which Overton Windows can change and the amount of uncertainty there currently is among policymakers, but this is a post for another day, perhaps.)
I’ll conclude by noting that some people have went as far as to say that RSPs are intentionally trying to dilute the policy conversation. I’m not yet convinced this is the case, and I really hope it’s not. But I’d really like to see more coming out of ARC, Anthropic, and other RSP-supporters to earn the trust of people who are (IMO reasonably) suspicious when scaling labs come out and say “hey, you know what the policy response should be? Let us keep scaling, and trust us to figure it out over time, but we’ll brand it as this nice catchy thing called Responsible Scaling.”
Strongly agree with almost all of this.
My main disagreement is that I don’t think the “What would a good RSP look like?” description is sufficient without explicit conditions beyond evals. In particular that we should expect that our suite of tests will be insufficient at some point, absent hugely improved understanding—and that we shouldn’t expect to understand how and why it’s insufficient before reality punches us in the face.
Therefore, it’s not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them to work as intended].
We also need strong evidence that there’ll be no catastrophe-inducing problems we didn’t think of. (not [none that SotA methods could find], not [none discoverable with reasonable cost]; none)
This can’t be implicit, since it’s a central way that we die.
If it’s hard/impractical to estimate, then we should pause until we can estimate it more accurately
This is the kind of thing that I expect to be omitted from RSPs as a matter of course, precisely because we lack the understanding to create good models/estimates, legible tests etc. That doesn’t make it ok. Blank map is not blank territory.
If we’re thinking of better mechanisms to achieve a pause, I’d add:
Call it something like a “Responsible training and deployment policy (RTDP)”, not an RSP. Scaling is the thing in question. We should remove it from the title if we want to give the impression that it might not happen. (compare “Responsible farming policy”, “Responsible fishing policy”, “Responsible diving policy”—all strongly imply that responsible x-ing is possible, and that x-ing will continue to happen subject to various constraints)
Don’t look for a ‘practical’ solution. A serious pause/stop will obviously be impractical (yet not impossible). To restrict ourselves to practical approaches is to give up on any meaningful pause. Doing the impractical is not going to get easier later.
Define ASLs or similar now rather than waiting until we’re much closer to achieving them. Waiting to define them later gives the strong impression that the approach is [pick the strongest ASL definitions and measures that will achievable so that we can keep scaling] and not, [pick ASL definitions and measures that are clearly sufficient for safety].
Evan’s own “We need to make sure that, once we have solid understanding-based evals, governments make them mandatory” only re-enforces this impression. Whether we have them is irrelevant to the question of whether they’re necessary.
Be clear and explicit about the potential for very long pauses, and the conditions that would lead to them. Where it’s hard to give precise conditions, give high-level conditions and very conservative concrete defaults (not [reasonably conservative]; [unreasonably conservative]). Have a policy where a compelling, externally reviewed argument is necessary before any conservative default can be relaxed.
I think there’s a big danger in safety people getting something in place that we think/hope will imply a later pause, only to find that when it really counts the labs decide not to interpret things that way and to press forward anyway—with government/regulator backing, since they’re doing everything practical, everything reasonable....
Assuming this won’t happen seems dangerously naive.
If labs are going to re-interpret the goalposts and continue running into the minefield, we need to know this as soon as possible. This requires explicit clarity over what is being asked / suggested / eventually-entailed.
The Anthropic RSP fails at this IMO: no understanding-based requirements; no explicit mention that pausing for years may be necessary.
The ARC Evals RSP description similarly fails—if RSPs are intended to be a path to pausing. “Practical middle ground” amounts to never realistically pausing. They entirely overlook overconfidence as a problem. (frankly, I find this confusing coming from Beth/Paul et al)
Have separate RTDPs for unilateral adoption, and adoption subject to multi-lab agreement / international agreement etc. (I expect at least three levels would make sense)
This is a natural way to communicate “We’d ideally like [very strict measures], though [less strict measures] are all we can commit to unilaterally”.
If a lab’s unilateral RTDP looks identical to their [conditional on international agreement] RTDP, then they have screwed up.
Strongly consider pushing for safety leads to write and sign the RTDP (with help, obviously). I don’t want the people who know most about safety to be “involved in the drafting process”; I want to know that they oversaw the process and stand by the final version.
I’m sure there are other sensible additions, but that’d be a decent start.
Yeah, I agree—that’s why I’m specifically optimistic about understanding-based evals, since I think they actually have the potential to force us to catch unknown unknowns here, the idea being that they require you to prove that you actually understand your model to a level where you’d know if there were anything wrong that your other evals might miss.
See the bottom of this comment: my main objection here is that if we were to try to define it now, we’d end up defining something easily game-able because we don’t yet have metrics for understanding that aren’t easily game-able. So if we want something that will actually be robust, we have to wait until we know what that something might be—and ideally be very explicit that we don’t yet know what we could put there.
I definitely agree that this is a serious concern! That’s part of why I’m writing this post: I want more public scrutiny and pressure on RSPs and their implementation to try to prevent this sort of thing.
IANAL, but I think that this is currently impossible due to anti-trust regulations. The White House would need to enact a safe harbor policy for anti-trust considerations in the context of AI safety to make this possible.
I don’t know anything about anti-trust enforcement, but it seems to me that this might be a case where labs should do it anyways & delay hypothetical anti-trust enforcement by fighting in court.
I happen to think that the Anthropic RSP is fine for what it is, but it just doesn’t actually make any interesting claims yet. The key thing is that they’re committing to actually having an ASL-4 criteria and safety argument in the future. From my perspective, the Anthropic RSP effectively is an outline for the sort of thing an RSP could be (run evals, have safety buffer, assume continuity, etc) as well as a commitment to finish the key parts of the RSP later. This seems ok to me.
I would preferred if they included tentative proposals for ASL-4 evaluations and what their current best safety plan/argument for ASL-4 looks like (using just current science, no magic). Then, explain that plan wouldn’t be sufficient for reasonable amounts of safety (insofar as this is what they think).
Right now, they just have a bulleted list for ASL-4 countermeasures, but this is the main interesting thing at me. (I’m not really sold on substantial risk from systems which aren’t capable of carrying out that harm mostly autonomously, so I don’t think ASL-3 is actually important except as setup.)
Yeah, of course this would be nice. But the reason that ARC and Anthropic didn’t write this ‘good RSP’ isn’t that they’re reckless, but because writing such an RSP is a hard open problem. It would be great to have “specific tests” for various dangerous capabilities, or “Some way of handling race dynamics,” but nobody knows what those are.
Of course the specific object-level commitments Anthropic has made so far are insufficient. (Fortunately, they committed to make more specific object-level commitments before reaching ASL-3, and ASL-3 is reasonably well-specified [edit: and almost certainly below x-catastrophe-level].) I praise Anthropic’s RSP and disagree with your vibe because I don’t think you or I or anyone else could write much better commitments. (If you have specific commitments-labs-should-make in mind, please share them!)
(Insofar as you’re just worried about comms and what-people-think-about-RSPs rather than how-good-RSPs-are, I’m agnostic.)
Thanks, Zach! Responses below:
I agree that writing a good RSP is a hard open problem. I don’t blame ARC for not having solved the “how can we scale safely” problem. I am disappointed in ARC for communicating about this poorly (in their public blog post, and [speculative/rumor-based] maybe in their private government advocacy as well.).
I’m mostly worried about the comms/advocacy/policy implications. If Anthropic and ARC had come out and said “look, we have some ideas, but the field really isn’t mature enough and we really don’t know what we’re doing, and these voluntary commitments are clearly insufficient, but if you really had to ask us for our best-guesses RE what to do if there is no government regulation coming, and for some reason we had to keep racing toward god-like AI, here are our best guesses. But please note that this is woefully insufficient and we would strongly prefer government intervention to buy enough time so that we can have actual plans.”
I also expect most of the (positive or negative) impact of the recent RSP posts to come from the comms/advocacy/policy externalities.
I don’t think the question of whether you or I could write better commitments is very relevant. My claim is more like “no one can make a good enough RSP right now, so instead of selling governments on RSPs, we should be communicating clearly that the current race to godlike AI is awful, our AIS ideas are primitive, we might need to pause for decades, and we should start developing the hardware monitoring//risk assessment//emergency powers//kill switch infrastructure//international institutions that we will need.”
But if I had to answer this directly: I actually do think that if I spent 1-2 weeks working on coming up with better commitments, and I could ask for feedback from like 1-3 advisors of my choice, I could probably come up with “better” commitments. I don’t think this is because I’m particularly smart, though– I just think the bar is low. My impression is that the 22-page doc from Anthropic didn’t actually have many commitments.
The main commitments that stood out to me were: (a) run evals [exact evals unspecified] at least every 4X in effective compute, (b) have good infosec before you have models that can enable bioweapons or other Scary Misuse models, and (c) define ASL-4 criteria once you have Scary Misuse models. There are some other more standard/minor things as well like sharing vulnerabilities with other labs, tiered model access, and patching jailbreaks [how? and how much is sufficient?].
The main caveat I’ll add is that “better” is a fuzzy term in this context. Like, I’m guessing lot of the commitments I’d come up with are things that are more costly from Anthropic’s POV. So maybe many of them would be “worse” in the sense that Anthropic wouldn’t be willing to adopt them, or would argue that other labs are not going to adopt them, therefore they can’t adopt them otherwise they’ll be less likely to win the race.)
i would love to see competing RSPs (or, better yet, RTDPs, as @Joe_Collman pointed out in a cousin comment).
I mean, I am very explicitly trying to communicate what I see as the success story here. I agree that there are many ways that this could fail—I mention a bunch of them in the last section—but I think that having a clear story of how things could go well is important to being able to work to actually achieve that story.
I want to be very clear that I’ve been really happy to see all the people pushing for strong asks here. I think it’s a really valuable thing to be doing, and what I’m trying to do here is not stop that but help it focus on more concrete asks.
To be clear, I definitely agree with this. My position is not “RSPs are all we need”, “pauses are bad”, “pause advocacy is bad”, etc.—my position is that getting good RSPs is an effective way to implement a pause: i.e. “RSPs are pauses done right.”
Some feedback on this: my expectation upon seeing your title was that you would argue, or that you implicitly believe, that RSPs are better than other current “pause” attempts/policies/ideas. I think this expectation came from the common usage of the phrase “done right” to mean that other people are doing it wrong or at least doing it suboptimally.
I mean, to be clear, I am saying something like “RSPs are the most effective way to implement a pause that I know of.” The thing I’m not saying is just that “RSPs are the only policy thing we should be doing.”
This reads as some sort of confused motte and bailey. Are RSPs “an effective way” or “the most effective way… [you] know of”? These are different things, with each being stronger/weaker in different ways. Regardless, the title could still be made much more accurate to your beliefs, e.g. ~’RSPs are our (current) best bet on a pause’. ‘An effective way’ is definitely not “i.e … done right”, but “the most effective way… that I know of” is also not.
I disagree? I think the plain English meaning of the title “RSPs are pauses done right” is precisely “RSPs are the right way to do pauses (that I know of)” which is exactly what I think and exactly what I am defending here. I honestly have no idea what else that title would mean.
Sorry yeah I could have explained what I meant further. The way I see it:
‘X is the most effective way that I know of’ = X tops your ranking of the different ways, but could still be below a minimum threshold (e.g. X doesn’t have to even properly work, it could just be less ineffective than all the rest). So one could imagine someone saying “X is the most effective of all the options I found and it still doesn’t actually do the job!”
‘X is an effective way’ = ‘X works, and it works above a certain threshold’.
‘X is Y done right’ = ‘X works and is basically the only acceptable way to do Y,’ where it’s ambiguous or contextual as to whether ‘acceptable’ means that it at least works, that it’s effective, or sth like ‘it’s so clearly the best way that anyone doing the 2nd best thing is doing something bad’.
Why then “RSPs are the most effective way to implement a pause that I know of” is literally not the title of your post?
Are you thinking about this post? I don’t see any explicit claims that the moratorium folks are extreme. What passage are you thinking about?
In terms of explicit claims:
“So one extreme side of the spectrum is build things as fast as possible, release things as much as possible, maximize technological progress [...].
The other extreme position, which I also have some sympathy for, despite it being the absolutely opposite position, is you know, Oh my god this stuff is really scary.
The most extreme version of it was, you know, we should just pause, we should just stop, we should just stop building the technology for, indefinitely, or for some specified period of time. [...] And you know, that extreme position doesn’t make much sense to me either.”
Dario Amodei, Anthropic CEO, explaining his company’s “Responsible Scaling Policy” on the Logan Bartlett Podcast on Oct 6, 2023.
Starts at around 49:40.
This example is not a claim by ARC though, seems important to keep track of this in a discussion of what ARC did or didn’t claim, even as others making such claims is also relevant.
I was thinking about this passage:
I think “extreme” was subjective and imprecise wording on my part, and I appreciate you catching this. I’ve edited the sentence to say “Instead, ARC implies that the moratorium folks are unrealistic, and tries to say they operate on an extreme end of the spectrum, on the opposite side of those who believe it’s too soon to worry about catastrophes whatsoever.”
This is a really important thing to iron out.
Going forward (through the 2020s), it’s really important to avoid underestimating the ratio of money going into facilitating an AI pause vs money subverting or thwarting an AI pause. The impression I get is that the vast majority of people are underestimating how much money and talent will end up being allocated towards the end of subverting or thwarting an AI pause, e.g. finding galaxy-brained ways to intimidate or mislead well-intentioned AI safety orgs into self-sabotage (e.g. opposing policies that are actually feasible or even mandatory for human survival like an AI pause) or being turned against eachother (which is unambiguously the kind of thing that happens in a world with very high lawyers-per-capita, and in particular in issues and industries where lots of money is at stake). False alarms are almost an equally serious issue because false alarms also severely increase vulnerability, which further incentivises adverse actions against the AI safety community by outsider third parties (e.g. due to signalling high payoff and low risk of detection for any adverse actions).