I got the impression that Anthropic wants to do the following things before it scales beyond ASL-3:
Did you mean ASL-2 here? This seems like a pretty important detail to get right. (What they would need to do to scale beyond ASL-3 is meet the standard of an ASL-4 lab, which they have not developed yet.)
I agree with Habryka that these don’t seem likely to cause Anthropic to stop scaling:
By design, RSPs are conditional pauses; you pause until you have met the standard, and then you continue. If you get the standard in place soon enough, you don’t need to pause at all. This incentivizes implementing the security and safety procedures as soon as possible, which seems good to me.
But the RSP does not commit Anthropic to having any particular containment measures or any particular evidence that it is safe to scale to ASL-4–it only commits Anthropic to publish a post about ASL-4 systems. This is why I don’t consider the ASL-4 section to be a concrete commitment.
Yes, I agree that the ASL-4 part is an IOU, and I predict that when they eventually publish it there will be controversy over whether or not they got it right. (Ideally, by then we’ll have a consensus framework and independent body that develops those standards, which Anthropic will just sign on to.)
Again, this is by design; the underlying belief of the RSP is that we can only see so far ahead thru the fog, and so we should set our guidelines bit-by-bit, rather than pausing until we can see our way all the way to an aligned sovereign.
My understanding is that their commitment is to stop once their ASL-3 evals are triggered. They hope that their ASL-3 evals will be conservative enough to trigger before they actually have an ASL-3 system, but I think that’s an open question. I’ve edited my comment to say “before Anthropic scales beyond systems that trigger their ASL-3 evals”. See this section from their RSP below (bolding my own):
“We commit to define ASL-4 evaluations before we first train ASL-3 models (i.e., before continuing training beyond when ASL-3 evaluations are triggered).”
By design, RSPs are conditional pauses; you pause until you have met the standard, and then you continue.
Yup, this makes sense. I don’t think we disagree on the definition of a conditional pause. But I think if a company says “we will do X before we keep scaling”, and then X is a relatively easy standard to meet, I would think it’s misleading to say “the company has specified concrete commitments under which they would pause.” Even if technically accurate, it gives an overly-rosy picture of what happened, and I would expect it to systematically mislead readers into thinking that the commitments were stronger.
For the Anthropic RSP in particular, I think it’s accurate & helpful to say “Anthropic has said that they will not scale past systems that substantially increase misuse risk [if they are able to identify this] until they have better infosec and until they have released a blog post defining ASL-4 systems and telling the world how they plan to develop those safely.”
Then, separately, readers can decide for themselves how “concrete” or “good” these commitments are. In my opinion, these are not particularly concrete, and I was expecting much more when I heard the initial way that people were communicating about RSPs.
the underlying belief of the RSP is that we can only see so far ahead thru the fog, and so we should set our guidelines bit-by-bit, rather than pausing until we can see our way all the way to an aligned sovereign.
This feels a bit separate from the above discussion, and the “wait until we can see all the way to an aligned sovereign” is not an accurate characterization of my view, but here’s how I would frame this.
My underlying problem with the RSP framework is that it presumes that companies should be allowed to keep scaling until there is clear and imminent danger, at which point we do [some unspecified thing]. I think a reasonable response from RSP defenders is something like “yes, but we also want stronger regulation and we see this as a step in the right direction.” And then the crux becomes something like “OK, on balance, what effect will RSPs have on government regulations [perhaps relative to nothing, or perhaps relative to what would’ve happened if the energy that went into RSPS had went into advocating for something else?”
I currently have significant concerns that if the RSP framework, as it has currently been described, is used as the basis for regulation, it will lock-in an incorrect burden of proof. In other words, governments might endorse some sort of “you can keep scaling until auditors can show clear signs of danger and prove that your safeguards are insufficient.” This is the opposite of what we expect in other high-risk sectors.
That said, it’s not impossible that RSPs will actually get us closer to better regulation– I do buy some sort of general “if industry does something, it’s easier for governments to implement it” logic. But I want to see RSP advocates engage more with the burden of proof concerns.
To make this more concrete: I would be enthusiastic if ARC Evals released a blog post saying something along the lines of: “we believe the burden of proof should be on Frontier AI developers to show us affirmative evidence of safety. We have been working on dangerous capability evaluations, which we think will be a useful part of regulatory frameworks, but we would strongly support regulations that demand more evidence than merely the absence of dangerous capabilities. Here are some examples of what that would look like...”
My understanding is that their commitment is to stop once their ASL-3 evals are triggered.
Ok, we agree. By “beyond ASL-3” I thought you meant “stuff that’s outside the category ASL-3″ instead of “the first thing inside the category ASL-3”.
For the Anthropic RSP in particular, I think it’s accurate & helpful to say
Yep, that summary seems right to me. (I also think the “concrete commitments” statement is accurate.)
But I want to see RSP advocates engage more with the burden of proof concerns.
Yeah, I also think putting the burden of proof on scaling (instead of on pausing) is safer and probably appropriate. I am hesitant about it on process grounds; it seems to me like evidence of safety might require the scaling that we’re not allowing until we see evidence of safety. On net, it seems like the right decision on the current margin but the same lock-in concerns (if we do the right thing now for the wrong reasons perhaps we will do the wrong thing for the same reasons in the future) worry me about simply switching the burden of proof (instead of coming up with a better system to evaluate risk).
Did you mean ASL-2 here? This seems like a pretty important detail to get right. (What they would need to do to scale beyond ASL-3 is meet the standard of an ASL-4 lab, which they have not developed yet.)
By design, RSPs are conditional pauses; you pause until you have met the standard, and then you continue. If you get the standard in place soon enough, you don’t need to pause at all. This incentivizes implementing the security and safety procedures as soon as possible, which seems good to me.
Yes, I agree that the ASL-4 part is an IOU, and I predict that when they eventually publish it there will be controversy over whether or not they got it right. (Ideally, by then we’ll have a consensus framework and independent body that develops those standards, which Anthropic will just sign on to.)
Again, this is by design; the underlying belief of the RSP is that we can only see so far ahead thru the fog, and so we should set our guidelines bit-by-bit, rather than pausing until we can see our way all the way to an aligned sovereign.
My understanding is that their commitment is to stop once their ASL-3 evals are triggered. They hope that their ASL-3 evals will be conservative enough to trigger before they actually have an ASL-3 system, but I think that’s an open question. I’ve edited my comment to say “before Anthropic scales beyond systems that trigger their ASL-3 evals”. See this section from their RSP below (bolding my own):
“We commit to define ASL-4 evaluations before we first train ASL-3 models (i.e., before continuing training beyond when ASL-3 evaluations are triggered).”
Yup, this makes sense. I don’t think we disagree on the definition of a conditional pause. But I think if a company says “we will do X before we keep scaling”, and then X is a relatively easy standard to meet, I would think it’s misleading to say “the company has specified concrete commitments under which they would pause.” Even if technically accurate, it gives an overly-rosy picture of what happened, and I would expect it to systematically mislead readers into thinking that the commitments were stronger.
For the Anthropic RSP in particular, I think it’s accurate & helpful to say “Anthropic has said that they will not scale past systems that substantially increase misuse risk [if they are able to identify this] until they have better infosec and until they have released a blog post defining ASL-4 systems and telling the world how they plan to develop those safely.”
Then, separately, readers can decide for themselves how “concrete” or “good” these commitments are. In my opinion, these are not particularly concrete, and I was expecting much more when I heard the initial way that people were communicating about RSPs.
This feels a bit separate from the above discussion, and the “wait until we can see all the way to an aligned sovereign” is not an accurate characterization of my view, but here’s how I would frame this.
My underlying problem with the RSP framework is that it presumes that companies should be allowed to keep scaling until there is clear and imminent danger, at which point we do [some unspecified thing]. I think a reasonable response from RSP defenders is something like “yes, but we also want stronger regulation and we see this as a step in the right direction.” And then the crux becomes something like “OK, on balance, what effect will RSPs have on government regulations [perhaps relative to nothing, or perhaps relative to what would’ve happened if the energy that went into RSPS had went into advocating for something else?”
I currently have significant concerns that if the RSP framework, as it has currently been described, is used as the basis for regulation, it will lock-in an incorrect burden of proof. In other words, governments might endorse some sort of “you can keep scaling until auditors can show clear signs of danger and prove that your safeguards are insufficient.” This is the opposite of what we expect in other high-risk sectors.
That said, it’s not impossible that RSPs will actually get us closer to better regulation– I do buy some sort of general “if industry does something, it’s easier for governments to implement it” logic. But I want to see RSP advocates engage more with the burden of proof concerns.
To make this more concrete: I would be enthusiastic if ARC Evals released a blog post saying something along the lines of: “we believe the burden of proof should be on Frontier AI developers to show us affirmative evidence of safety. We have been working on dangerous capability evaluations, which we think will be a useful part of regulatory frameworks, but we would strongly support regulations that demand more evidence than merely the absence of dangerous capabilities. Here are some examples of what that would look like...”
Ok, we agree. By “beyond ASL-3” I thought you meant “stuff that’s outside the category ASL-3″ instead of “the first thing inside the category ASL-3”.
Yep, that summary seems right to me. (I also think the “concrete commitments” statement is accurate.)
Yeah, I also think putting the burden of proof on scaling (instead of on pausing) is safer and probably appropriate. I am hesitant about it on process grounds; it seems to me like evidence of safety might require the scaling that we’re not allowing until we see evidence of safety. On net, it seems like the right decision on the current margin but the same lock-in concerns (if we do the right thing now for the wrong reasons perhaps we will do the wrong thing for the same reasons in the future) worry me about simply switching the burden of proof (instead of coming up with a better system to evaluate risk).