FWIW I read Anthropic’s RSP and came away with the sense that they would stop scaling if their evals suggested that a model being trained either registered as ASL-3 or was likely to (if they scaled it further). They would then restart scaling once they 1) had a definition of the ASL-4 model standard and lab standard and 2) met the standard of an ASL-3 lab.
I got the impression that Anthropic wants to do the following things before it scales beyond systems that trigger their ASL-3 evals:
Have good enough infosec so that it is “unlikely” for non-state actors to steal model weights, and state actors can only steal them “with significant expense.”
Be ready to deploy evals at least once every 4X in effective compute
Have a blog post that tells the world what they plan to do to align ASL-4 systems.
The security commitment is the most concrete, and I agree with Habryka that these don’t seem likely to cause Anthropic to stop scaling:
Like, I agree that some of these commitments are costly, but I don’t see how there is any world where Anthropic would like to continue scaling but finds itself incapable of doing so, which is what I would consider a “pause” to mean. Like, they can just implement their checklist of security requirements and then go ahead.
Maybe this is quibbling over semantics, but it does really feels quite qualitatively different to me. When OpenAI said that they would spend some substantial fraction of their compute on “Alignment Research” while they train their next model, I think it would be misleading to say “OpenAI has committed to conditionally pausing model scaling”.
The commitment to define ASL-4 and tell us how they plan to align it does not seem like a concrete commitment. A concrete commitment would look something like “we have solved X open problem, in alignment as verified via Y verification method” or “we have the ability to pass X test with Y% accuracy.”
As is, the commitment is very loose. Anthropic could just publish a post saying “ASL-4 systems are systems that can replicate autonomously in the wild, perform at human-level at most cognitive tasks, or substantially boost AI progress. To align it, we will use Constitutional AI 2.0. And we are going to make our information security even better.”
To be clear, the RSP is consistent with a world in which Anthropic actually chooses to pause before scaling to ASL-4 systems. Like, maybe they will want their containment measures for ASL-4 to be really really good, which will require a major pause. But the RSP does not commit Anthropic to having any particular containment measures or any particular evidence that it is safe to scale to ASL-4–it only commits Anthropic to publish a post about ASL-4 systems. This is why I don’t consider the ASL-4 section to be a concrete commitment.
The same thing holds for the evals point– Anthropic could say “we feel like our evals are good enough” or they could say “ah, we actually need to pause for a long time to get better evals.” But the RSP is consistent with either of these worlds, and Anthropic has enough flexibility/freedom here that I don’t think it makes sense to call this a concrete commitment.
Note though that the prediction market RE Anthropic’s security commitments currently gives Anthropic a 35% chance of pausing for at least one month, which has updated me somewhat in the direction of “maybe the security commitment is more concrete than I thought”. Though I still think it’s a bad idea to train a model capable of making biological weapons if it can be stolen by state actors with significant expense. The commitment would be more concrete if it said something like “state actors would not be able to steal this model unless they spent at least $X, which we will operationally define as passing Y red-teaming effort by Z independent group.”
I got the impression that Anthropic wants to do the following things before it scales beyond ASL-3:
Did you mean ASL-2 here? This seems like a pretty important detail to get right. (What they would need to do to scale beyond ASL-3 is meet the standard of an ASL-4 lab, which they have not developed yet.)
I agree with Habryka that these don’t seem likely to cause Anthropic to stop scaling:
By design, RSPs are conditional pauses; you pause until you have met the standard, and then you continue. If you get the standard in place soon enough, you don’t need to pause at all. This incentivizes implementing the security and safety procedures as soon as possible, which seems good to me.
But the RSP does not commit Anthropic to having any particular containment measures or any particular evidence that it is safe to scale to ASL-4–it only commits Anthropic to publish a post about ASL-4 systems. This is why I don’t consider the ASL-4 section to be a concrete commitment.
Yes, I agree that the ASL-4 part is an IOU, and I predict that when they eventually publish it there will be controversy over whether or not they got it right. (Ideally, by then we’ll have a consensus framework and independent body that develops those standards, which Anthropic will just sign on to.)
Again, this is by design; the underlying belief of the RSP is that we can only see so far ahead thru the fog, and so we should set our guidelines bit-by-bit, rather than pausing until we can see our way all the way to an aligned sovereign.
My understanding is that their commitment is to stop once their ASL-3 evals are triggered. They hope that their ASL-3 evals will be conservative enough to trigger before they actually have an ASL-3 system, but I think that’s an open question. I’ve edited my comment to say “before Anthropic scales beyond systems that trigger their ASL-3 evals”. See this section from their RSP below (bolding my own):
“We commit to define ASL-4 evaluations before we first train ASL-3 models (i.e., before continuing training beyond when ASL-3 evaluations are triggered).”
By design, RSPs are conditional pauses; you pause until you have met the standard, and then you continue.
Yup, this makes sense. I don’t think we disagree on the definition of a conditional pause. But I think if a company says “we will do X before we keep scaling”, and then X is a relatively easy standard to meet, I would think it’s misleading to say “the company has specified concrete commitments under which they would pause.” Even if technically accurate, it gives an overly-rosy picture of what happened, and I would expect it to systematically mislead readers into thinking that the commitments were stronger.
For the Anthropic RSP in particular, I think it’s accurate & helpful to say “Anthropic has said that they will not scale past systems that substantially increase misuse risk [if they are able to identify this] until they have better infosec and until they have released a blog post defining ASL-4 systems and telling the world how they plan to develop those safely.”
Then, separately, readers can decide for themselves how “concrete” or “good” these commitments are. In my opinion, these are not particularly concrete, and I was expecting much more when I heard the initial way that people were communicating about RSPs.
the underlying belief of the RSP is that we can only see so far ahead thru the fog, and so we should set our guidelines bit-by-bit, rather than pausing until we can see our way all the way to an aligned sovereign.
This feels a bit separate from the above discussion, and the “wait until we can see all the way to an aligned sovereign” is not an accurate characterization of my view, but here’s how I would frame this.
My underlying problem with the RSP framework is that it presumes that companies should be allowed to keep scaling until there is clear and imminent danger, at which point we do [some unspecified thing]. I think a reasonable response from RSP defenders is something like “yes, but we also want stronger regulation and we see this as a step in the right direction.” And then the crux becomes something like “OK, on balance, what effect will RSPs have on government regulations [perhaps relative to nothing, or perhaps relative to what would’ve happened if the energy that went into RSPS had went into advocating for something else?”
I currently have significant concerns that if the RSP framework, as it has currently been described, is used as the basis for regulation, it will lock-in an incorrect burden of proof. In other words, governments might endorse some sort of “you can keep scaling until auditors can show clear signs of danger and prove that your safeguards are insufficient.” This is the opposite of what we expect in other high-risk sectors.
That said, it’s not impossible that RSPs will actually get us closer to better regulation– I do buy some sort of general “if industry does something, it’s easier for governments to implement it” logic. But I want to see RSP advocates engage more with the burden of proof concerns.
To make this more concrete: I would be enthusiastic if ARC Evals released a blog post saying something along the lines of: “we believe the burden of proof should be on Frontier AI developers to show us affirmative evidence of safety. We have been working on dangerous capability evaluations, which we think will be a useful part of regulatory frameworks, but we would strongly support regulations that demand more evidence than merely the absence of dangerous capabilities. Here are some examples of what that would look like...”
My understanding is that their commitment is to stop once their ASL-3 evals are triggered.
Ok, we agree. By “beyond ASL-3” I thought you meant “stuff that’s outside the category ASL-3″ instead of “the first thing inside the category ASL-3”.
For the Anthropic RSP in particular, I think it’s accurate & helpful to say
Yep, that summary seems right to me. (I also think the “concrete commitments” statement is accurate.)
But I want to see RSP advocates engage more with the burden of proof concerns.
Yeah, I also think putting the burden of proof on scaling (instead of on pausing) is safer and probably appropriate. I am hesitant about it on process grounds; it seems to me like evidence of safety might require the scaling that we’re not allowing until we see evidence of safety. On net, it seems like the right decision on the current margin but the same lock-in concerns (if we do the right thing now for the wrong reasons perhaps we will do the wrong thing for the same reasons in the future) worry me about simply switching the burden of proof (instead of coming up with a better system to evaluate risk).
FWIW I read Anthropic’s RSP and came away with the sense that they would stop scaling if their evals suggested that a model being trained either registered as ASL-3 or was likely to (if they scaled it further). They would then restart scaling once they 1) had a definition of the ASL-4 model standard and lab standard and 2) met the standard of an ASL-3 lab.
Do you not think that? Why not?
I got the impression that Anthropic wants to do the following things before it scales beyond systems that trigger their ASL-3 evals:
Have good enough infosec so that it is “unlikely” for non-state actors to steal model weights, and state actors can only steal them “with significant expense.”
Be ready to deploy evals at least once every 4X in effective compute
Have a blog post that tells the world what they plan to do to align ASL-4 systems.
The security commitment is the most concrete, and I agree with Habryka that these don’t seem likely to cause Anthropic to stop scaling:
The commitment to define ASL-4 and tell us how they plan to align it does not seem like a concrete commitment. A concrete commitment would look something like “we have solved X open problem, in alignment as verified via Y verification method” or “we have the ability to pass X test with Y% accuracy.”
As is, the commitment is very loose. Anthropic could just publish a post saying “ASL-4 systems are systems that can replicate autonomously in the wild, perform at human-level at most cognitive tasks, or substantially boost AI progress. To align it, we will use Constitutional AI 2.0. And we are going to make our information security even better.”
To be clear, the RSP is consistent with a world in which Anthropic actually chooses to pause before scaling to ASL-4 systems. Like, maybe they will want their containment measures for ASL-4 to be really really good, which will require a major pause. But the RSP does not commit Anthropic to having any particular containment measures or any particular evidence that it is safe to scale to ASL-4– it only commits Anthropic to publish a post about ASL-4 systems. This is why I don’t consider the ASL-4 section to be a concrete commitment.
The same thing holds for the evals point– Anthropic could say “we feel like our evals are good enough” or they could say “ah, we actually need to pause for a long time to get better evals.” But the RSP is consistent with either of these worlds, and Anthropic has enough flexibility/freedom here that I don’t think it makes sense to call this a concrete commitment.
Note though that the prediction market RE Anthropic’s security commitments currently gives Anthropic a 35% chance of pausing for at least one month, which has updated me somewhat in the direction of “maybe the security commitment is more concrete than I thought”. Though I still think it’s a bad idea to train a model capable of making biological weapons if it can be stolen by state actors with significant expense. The commitment would be more concrete if it said something like “state actors would not be able to steal this model unless they spent at least $X, which we will operationally define as passing Y red-teaming effort by Z independent group.”
Did you mean ASL-2 here? This seems like a pretty important detail to get right. (What they would need to do to scale beyond ASL-3 is meet the standard of an ASL-4 lab, which they have not developed yet.)
By design, RSPs are conditional pauses; you pause until you have met the standard, and then you continue. If you get the standard in place soon enough, you don’t need to pause at all. This incentivizes implementing the security and safety procedures as soon as possible, which seems good to me.
Yes, I agree that the ASL-4 part is an IOU, and I predict that when they eventually publish it there will be controversy over whether or not they got it right. (Ideally, by then we’ll have a consensus framework and independent body that develops those standards, which Anthropic will just sign on to.)
Again, this is by design; the underlying belief of the RSP is that we can only see so far ahead thru the fog, and so we should set our guidelines bit-by-bit, rather than pausing until we can see our way all the way to an aligned sovereign.
My understanding is that their commitment is to stop once their ASL-3 evals are triggered. They hope that their ASL-3 evals will be conservative enough to trigger before they actually have an ASL-3 system, but I think that’s an open question. I’ve edited my comment to say “before Anthropic scales beyond systems that trigger their ASL-3 evals”. See this section from their RSP below (bolding my own):
“We commit to define ASL-4 evaluations before we first train ASL-3 models (i.e., before continuing training beyond when ASL-3 evaluations are triggered).”
Yup, this makes sense. I don’t think we disagree on the definition of a conditional pause. But I think if a company says “we will do X before we keep scaling”, and then X is a relatively easy standard to meet, I would think it’s misleading to say “the company has specified concrete commitments under which they would pause.” Even if technically accurate, it gives an overly-rosy picture of what happened, and I would expect it to systematically mislead readers into thinking that the commitments were stronger.
For the Anthropic RSP in particular, I think it’s accurate & helpful to say “Anthropic has said that they will not scale past systems that substantially increase misuse risk [if they are able to identify this] until they have better infosec and until they have released a blog post defining ASL-4 systems and telling the world how they plan to develop those safely.”
Then, separately, readers can decide for themselves how “concrete” or “good” these commitments are. In my opinion, these are not particularly concrete, and I was expecting much more when I heard the initial way that people were communicating about RSPs.
This feels a bit separate from the above discussion, and the “wait until we can see all the way to an aligned sovereign” is not an accurate characterization of my view, but here’s how I would frame this.
My underlying problem with the RSP framework is that it presumes that companies should be allowed to keep scaling until there is clear and imminent danger, at which point we do [some unspecified thing]. I think a reasonable response from RSP defenders is something like “yes, but we also want stronger regulation and we see this as a step in the right direction.” And then the crux becomes something like “OK, on balance, what effect will RSPs have on government regulations [perhaps relative to nothing, or perhaps relative to what would’ve happened if the energy that went into RSPS had went into advocating for something else?”
I currently have significant concerns that if the RSP framework, as it has currently been described, is used as the basis for regulation, it will lock-in an incorrect burden of proof. In other words, governments might endorse some sort of “you can keep scaling until auditors can show clear signs of danger and prove that your safeguards are insufficient.” This is the opposite of what we expect in other high-risk sectors.
That said, it’s not impossible that RSPs will actually get us closer to better regulation– I do buy some sort of general “if industry does something, it’s easier for governments to implement it” logic. But I want to see RSP advocates engage more with the burden of proof concerns.
To make this more concrete: I would be enthusiastic if ARC Evals released a blog post saying something along the lines of: “we believe the burden of proof should be on Frontier AI developers to show us affirmative evidence of safety. We have been working on dangerous capability evaluations, which we think will be a useful part of regulatory frameworks, but we would strongly support regulations that demand more evidence than merely the absence of dangerous capabilities. Here are some examples of what that would look like...”
Ok, we agree. By “beyond ASL-3” I thought you meant “stuff that’s outside the category ASL-3″ instead of “the first thing inside the category ASL-3”.
Yep, that summary seems right to me. (I also think the “concrete commitments” statement is accurate.)
Yeah, I also think putting the burden of proof on scaling (instead of on pausing) is safer and probably appropriate. I am hesitant about it on process grounds; it seems to me like evidence of safety might require the scaling that we’re not allowing until we see evidence of safety. On net, it seems like the right decision on the current margin but the same lock-in concerns (if we do the right thing now for the wrong reasons perhaps we will do the wrong thing for the same reasons in the future) worry me about simply switching the burden of proof (instead of coming up with a better system to evaluate risk).