a pretty specific framework with unique strengths I wouldn’t want overlooked.
What are some of the unique strengths of the framework that you think might get overlooked if we go with something more like “voluntary safety commitments” or “voluntary scaling commitments”?
(Ex: It seems plausible to me that you want to keep the word “scaling” in, since there are lots of safety commitments that could plausibly have nothing to do with future models, and “scaling” sort of forces you to think about what’s going to happen as models get more powerful.)
Things that distinguish an “RSP” or “RSP-type commitment” for me (though as with most concepts, something could lack a few of the items below and still overall seem like an “RSP” or a “draft aiming to eventually be an RSP”)
Scaling commitments: Commitments are not just about deployment but about creating/containing/scaling a model in the first place (in fact for most RSPs I think the scaling/containment commitments are the focus, much more so than the deployment commitments which are usually a bit vague and to-be-determined)
Red lines: The commitment spells out red lines in advance where even creating a model past that line would be unsafe given their current practices/security/alignment research, likely including some red lines where the developer admits they do not know how to ensure safety anymore, and commits to not reaching that point until the situation changes.
Iterative policy updates: For red-lines where the developer doesn’t know what exact mitigations would be sufficient to ensure safety, they identify a previous point where they do think they can mitigate risks, and commit to not scaling further than that until they have published a new commitment and given the public a chance to scrutinize it
Evaluations and accountability: I think this is an area that many RSPs do poorly at, where the developer presents clear externally accountable evidence that:
They will not suddenly cross any of their red lines before the mitigations are implemented/a new RSP version has been published and given scrutiny, by pointing at specific evaluations procedures and policies
Their mitigations/procedures/evaluations etc. will be implemented faithfully and in the spirit of the document, e.g. through audits/external oversight/whistleblowing etc.
Voluntary commitments that wouldn’t be RSPs:
We commit to various deployment mitigations, incident sharing etc. (these are not about scaling)
We commit to [amazing set of safety practices, including state-proof security and massive spending on alignment etc.] by 2026 (this would be amazing, but doesn’t identify red lines for when those mitigations would no longer be enough, and doesn’t make any commitment about what they would do if they hit concerning capabilities before 2026)
… many more obviously, I think RSPs are actually quite specific.
This seems like a solid list. Scaling certainly seems core to the RSP concept.
IMO “red lines, iterative policy updates, and evaluations & accountability” are sort of pointing at the same thing. Roughly something like “we promise not to cross X red line until we have published Y new policy and allowed the public to scrutinize it for Z amount of time.”
One interesting thing here is that none of the current RSPs meet this standard. I suppose the closest is Anthropic’s, where they say they won’t scale to ASL-4 until they publish a new RSP (this would cover “red line” but I don’t believe they commit to giving the public a chance to scrutinize it, so it would only partially meet “iterative policy updates” and wouldn’t meet “evaluations and accountability.”)
They will not suddenly cross any of their red lines before the mitigations are implemented/a new RSP version has been published and given scrutiny, by pointing at specific evaluations procedures and policies
This seems like the meat of an ideal RSP. I don’t think it’s done by any of the existing voluntary scaling commitments. All of them have this flavor of “our company leadership will determine when the mitigations are sufficient, and we do not commit to telling you what our reasoning is.” OpenAI’s PF probably comes the closest, IMO (e.g., leadership will evaluate if the mitigations have moved the model from the stuff described in the “critical risk” category to the stuff described in the “high risk” category.)
As long as the voluntary scaling commitments end up having this flavor of “leadership will make a judgment call based on its best reasoning”, it feels like the commitments lack most of the “teeth” of the kind of RSP you describe.
(So back to the original point– I think we could say that something is only an RSP if it has the “we won’t cross this red line until we give you a new policy and let you scrutinize it and also tell you how we’re going to reason about when our mitigations are sufficient” property, but then none of the existing commitments would qualify as RSPs. If we loosen the definition, then I think we just go back to “these are voluntary commitments that have to do with scaling & how the lab is thinking about risks from scaling.”)
Yeah, I think you’re kind of right about why scaling seems like a relevant term here. I really like that RSPs are explicit about different tiers of models posing different tiers of risks. I think larger models are just likely to be more dangerous, and dangerous in new and different ways, than the models we have today. And that the safety mitigations that apply to them need to be more rigorous than what we have today. As an example, this framework naturally captures the distinction between “open-sourcing is great today” and “open-sourcing might be very dangerous tomorrow,” which is roughly something I believe.
But in the end, I don’t actually care what the name is, I just care that there is a specific name for this relatively specific framework to distinguish it from all the other possibilities in the space of voluntary policies. That includes newer and better policies — i.e. even if you are skeptical of the value of RSPs, I think you should be in favor of a specific name for it so you can distinguish it from other, future voluntary safety policies that you are more supportive of.
I do dislike that “responsible” might come off as implying that these policies are sufficient, or that scaling is now safe. I could see “risk-informed” having the same issue, which is why “iterated/tiered scaling policy” seems a bit better to me.
even if you are skeptical of the value of RSPs, I think you should be in favor of a specific name for it so you can distinguish it from other, future voluntary safety policies that you are more supportive of
This is a great point– consider me convinced. Interestingly, it’s hard for me to really precisely define the things that make something an RSP as opposed to a different type of safety commitment, but there are some patterns in the existing RSP/PF/FSF that do seem to put them in a broader family. (Ex: Strong focus on model evaluations, implicit assumption that AI development should continue until/unless evidence of danger is found, implicit assumption that company executives will decide once safeguards are sufficient).
What are some of the unique strengths of the framework that you think might get overlooked if we go with something more like “voluntary safety commitments” or “voluntary scaling commitments”?
(Ex: It seems plausible to me that you want to keep the word “scaling” in, since there are lots of safety commitments that could plausibly have nothing to do with future models, and “scaling” sort of forces you to think about what’s going to happen as models get more powerful.)
Things that distinguish an “RSP” or “RSP-type commitment” for me (though as with most concepts, something could lack a few of the items below and still overall seem like an “RSP” or a “draft aiming to eventually be an RSP”)
Scaling commitments: Commitments are not just about deployment but about creating/containing/scaling a model in the first place (in fact for most RSPs I think the scaling/containment commitments are the focus, much more so than the deployment commitments which are usually a bit vague and to-be-determined)
Red lines: The commitment spells out red lines in advance where even creating a model past that line would be unsafe given their current practices/security/alignment research, likely including some red lines where the developer admits they do not know how to ensure safety anymore, and commits to not reaching that point until the situation changes.
Iterative policy updates: For red-lines where the developer doesn’t know what exact mitigations would be sufficient to ensure safety, they identify a previous point where they do think they can mitigate risks, and commit to not scaling further than that until they have published a new commitment and given the public a chance to scrutinize it
Evaluations and accountability: I think this is an area that many RSPs do poorly at, where the developer presents clear externally accountable evidence that:
They will not suddenly cross any of their red lines before the mitigations are implemented/a new RSP version has been published and given scrutiny, by pointing at specific evaluations procedures and policies
Their mitigations/procedures/evaluations etc. will be implemented faithfully and in the spirit of the document, e.g. through audits/external oversight/whistleblowing etc.
Voluntary commitments that wouldn’t be RSPs:
We commit to various deployment mitigations, incident sharing etc. (these are not about scaling)
We commit to [amazing set of safety practices, including state-proof security and massive spending on alignment etc.] by 2026 (this would be amazing, but doesn’t identify red lines for when those mitigations would no longer be enough, and doesn’t make any commitment about what they would do if they hit concerning capabilities before 2026)
… many more obviously, I think RSPs are actually quite specific.
This seems like a solid list. Scaling certainly seems core to the RSP concept.
IMO “red lines, iterative policy updates, and evaluations & accountability” are sort of pointing at the same thing. Roughly something like “we promise not to cross X red line until we have published Y new policy and allowed the public to scrutinize it for Z amount of time.”
One interesting thing here is that none of the current RSPs meet this standard. I suppose the closest is Anthropic’s, where they say they won’t scale to ASL-4 until they publish a new RSP (this would cover “red line” but I don’t believe they commit to giving the public a chance to scrutinize it, so it would only partially meet “iterative policy updates” and wouldn’t meet “evaluations and accountability.”)
This seems like the meat of an ideal RSP. I don’t think it’s done by any of the existing voluntary scaling commitments. All of them have this flavor of “our company leadership will determine when the mitigations are sufficient, and we do not commit to telling you what our reasoning is.” OpenAI’s PF probably comes the closest, IMO (e.g., leadership will evaluate if the mitigations have moved the model from the stuff described in the “critical risk” category to the stuff described in the “high risk” category.)
As long as the voluntary scaling commitments end up having this flavor of “leadership will make a judgment call based on its best reasoning”, it feels like the commitments lack most of the “teeth” of the kind of RSP you describe.
(So back to the original point– I think we could say that something is only an RSP if it has the “we won’t cross this red line until we give you a new policy and let you scrutinize it and also tell you how we’re going to reason about when our mitigations are sufficient” property, but then none of the existing commitments would qualify as RSPs. If we loosen the definition, then I think we just go back to “these are voluntary commitments that have to do with scaling & how the lab is thinking about risks from scaling.”)
Yeah, I think you’re kind of right about why scaling seems like a relevant term here. I really like that RSPs are explicit about different tiers of models posing different tiers of risks. I think larger models are just likely to be more dangerous, and dangerous in new and different ways, than the models we have today. And that the safety mitigations that apply to them need to be more rigorous than what we have today. As an example, this framework naturally captures the distinction between “open-sourcing is great today” and “open-sourcing might be very dangerous tomorrow,” which is roughly something I believe.
But in the end, I don’t actually care what the name is, I just care that there is a specific name for this relatively specific framework to distinguish it from all the other possibilities in the space of voluntary policies. That includes newer and better policies — i.e. even if you are skeptical of the value of RSPs, I think you should be in favor of a specific name for it so you can distinguish it from other, future voluntary safety policies that you are more supportive of.
I do dislike that “responsible” might come off as implying that these policies are sufficient, or that scaling is now safe. I could see “risk-informed” having the same issue, which is why “iterated/tiered scaling policy” seems a bit better to me.
This is a great point– consider me convinced. Interestingly, it’s hard for me to really precisely define the things that make something an RSP as opposed to a different type of safety commitment, but there are some patterns in the existing RSP/PF/FSF that do seem to put them in a broader family. (Ex: Strong focus on model evaluations, implicit assumption that AI development should continue until/unless evidence of danger is found, implicit assumption that company executives will decide once safeguards are sufficient).