Things that distinguish an “RSP” or “RSP-type commitment” for me (though as with most concepts, something could lack a few of the items below and still overall seem like an “RSP” or a “draft aiming to eventually be an RSP”)
Scaling commitments: Commitments are not just about deployment but about creating/containing/scaling a model in the first place (in fact for most RSPs I think the scaling/containment commitments are the focus, much more so than the deployment commitments which are usually a bit vague and to-be-determined)
Red lines: The commitment spells out red lines in advance where even creating a model past that line would be unsafe given their current practices/security/alignment research, likely including some red lines where the developer admits they do not know how to ensure safety anymore, and commits to not reaching that point until the situation changes.
Iterative policy updates: For red-lines where the developer doesn’t know what exact mitigations would be sufficient to ensure safety, they identify a previous point where they do think they can mitigate risks, and commit to not scaling further than that until they have published a new commitment and given the public a chance to scrutinize it
Evaluations and accountability: I think this is an area that many RSPs do poorly at, where the developer presents clear externally accountable evidence that:
They will not suddenly cross any of their red lines before the mitigations are implemented/a new RSP version has been published and given scrutiny, by pointing at specific evaluations procedures and policies
Their mitigations/procedures/evaluations etc. will be implemented faithfully and in the spirit of the document, e.g. through audits/external oversight/whistleblowing etc.
Voluntary commitments that wouldn’t be RSPs:
We commit to various deployment mitigations, incident sharing etc. (these are not about scaling)
We commit to [amazing set of safety practices, including state-proof security and massive spending on alignment etc.] by 2026 (this would be amazing, but doesn’t identify red lines for when those mitigations would no longer be enough, and doesn’t make any commitment about what they would do if they hit concerning capabilities before 2026)
… many more obviously, I think RSPs are actually quite specific.
This seems like a solid list. Scaling certainly seems core to the RSP concept.
IMO “red lines, iterative policy updates, and evaluations & accountability” are sort of pointing at the same thing. Roughly something like “we promise not to cross X red line until we have published Y new policy and allowed the public to scrutinize it for Z amount of time.”
One interesting thing here is that none of the current RSPs meet this standard. I suppose the closest is Anthropic’s, where they say they won’t scale to ASL-4 until they publish a new RSP (this would cover “red line” but I don’t believe they commit to giving the public a chance to scrutinize it, so it would only partially meet “iterative policy updates” and wouldn’t meet “evaluations and accountability.”)
They will not suddenly cross any of their red lines before the mitigations are implemented/a new RSP version has been published and given scrutiny, by pointing at specific evaluations procedures and policies
This seems like the meat of an ideal RSP. I don’t think it’s done by any of the existing voluntary scaling commitments. All of them have this flavor of “our company leadership will determine when the mitigations are sufficient, and we do not commit to telling you what our reasoning is.” OpenAI’s PF probably comes the closest, IMO (e.g., leadership will evaluate if the mitigations have moved the model from the stuff described in the “critical risk” category to the stuff described in the “high risk” category.)
As long as the voluntary scaling commitments end up having this flavor of “leadership will make a judgment call based on its best reasoning”, it feels like the commitments lack most of the “teeth” of the kind of RSP you describe.
(So back to the original point– I think we could say that something is only an RSP if it has the “we won’t cross this red line until we give you a new policy and let you scrutinize it and also tell you how we’re going to reason about when our mitigations are sufficient” property, but then none of the existing commitments would qualify as RSPs. If we loosen the definition, then I think we just go back to “these are voluntary commitments that have to do with scaling & how the lab is thinking about risks from scaling.”)
Things that distinguish an “RSP” or “RSP-type commitment” for me (though as with most concepts, something could lack a few of the items below and still overall seem like an “RSP” or a “draft aiming to eventually be an RSP”)
Scaling commitments: Commitments are not just about deployment but about creating/containing/scaling a model in the first place (in fact for most RSPs I think the scaling/containment commitments are the focus, much more so than the deployment commitments which are usually a bit vague and to-be-determined)
Red lines: The commitment spells out red lines in advance where even creating a model past that line would be unsafe given their current practices/security/alignment research, likely including some red lines where the developer admits they do not know how to ensure safety anymore, and commits to not reaching that point until the situation changes.
Iterative policy updates: For red-lines where the developer doesn’t know what exact mitigations would be sufficient to ensure safety, they identify a previous point where they do think they can mitigate risks, and commit to not scaling further than that until they have published a new commitment and given the public a chance to scrutinize it
Evaluations and accountability: I think this is an area that many RSPs do poorly at, where the developer presents clear externally accountable evidence that:
They will not suddenly cross any of their red lines before the mitigations are implemented/a new RSP version has been published and given scrutiny, by pointing at specific evaluations procedures and policies
Their mitigations/procedures/evaluations etc. will be implemented faithfully and in the spirit of the document, e.g. through audits/external oversight/whistleblowing etc.
Voluntary commitments that wouldn’t be RSPs:
We commit to various deployment mitigations, incident sharing etc. (these are not about scaling)
We commit to [amazing set of safety practices, including state-proof security and massive spending on alignment etc.] by 2026 (this would be amazing, but doesn’t identify red lines for when those mitigations would no longer be enough, and doesn’t make any commitment about what they would do if they hit concerning capabilities before 2026)
… many more obviously, I think RSPs are actually quite specific.
This seems like a solid list. Scaling certainly seems core to the RSP concept.
IMO “red lines, iterative policy updates, and evaluations & accountability” are sort of pointing at the same thing. Roughly something like “we promise not to cross X red line until we have published Y new policy and allowed the public to scrutinize it for Z amount of time.”
One interesting thing here is that none of the current RSPs meet this standard. I suppose the closest is Anthropic’s, where they say they won’t scale to ASL-4 until they publish a new RSP (this would cover “red line” but I don’t believe they commit to giving the public a chance to scrutinize it, so it would only partially meet “iterative policy updates” and wouldn’t meet “evaluations and accountability.”)
This seems like the meat of an ideal RSP. I don’t think it’s done by any of the existing voluntary scaling commitments. All of them have this flavor of “our company leadership will determine when the mitigations are sufficient, and we do not commit to telling you what our reasoning is.” OpenAI’s PF probably comes the closest, IMO (e.g., leadership will evaluate if the mitigations have moved the model from the stuff described in the “critical risk” category to the stuff described in the “high risk” category.)
As long as the voluntary scaling commitments end up having this flavor of “leadership will make a judgment call based on its best reasoning”, it feels like the commitments lack most of the “teeth” of the kind of RSP you describe.
(So back to the original point– I think we could say that something is only an RSP if it has the “we won’t cross this red line until we give you a new policy and let you scrutinize it and also tell you how we’re going to reason about when our mitigations are sufficient” property, but then none of the existing commitments would qualify as RSPs. If we loosen the definition, then I think we just go back to “these are voluntary commitments that have to do with scaling & how the lab is thinking about risks from scaling.”)