This seems like a solid list. Scaling certainly seems core to the RSP concept.
IMO “red lines, iterative policy updates, and evaluations & accountability” are sort of pointing at the same thing. Roughly something like “we promise not to cross X red line until we have published Y new policy and allowed the public to scrutinize it for Z amount of time.”
One interesting thing here is that none of the current RSPs meet this standard. I suppose the closest is Anthropic’s, where they say they won’t scale to ASL-4 until they publish a new RSP (this would cover “red line” but I don’t believe they commit to giving the public a chance to scrutinize it, so it would only partially meet “iterative policy updates” and wouldn’t meet “evaluations and accountability.”)
They will not suddenly cross any of their red lines before the mitigations are implemented/a new RSP version has been published and given scrutiny, by pointing at specific evaluations procedures and policies
This seems like the meat of an ideal RSP. I don’t think it’s done by any of the existing voluntary scaling commitments. All of them have this flavor of “our company leadership will determine when the mitigations are sufficient, and we do not commit to telling you what our reasoning is.” OpenAI’s PF probably comes the closest, IMO (e.g., leadership will evaluate if the mitigations have moved the model from the stuff described in the “critical risk” category to the stuff described in the “high risk” category.)
As long as the voluntary scaling commitments end up having this flavor of “leadership will make a judgment call based on its best reasoning”, it feels like the commitments lack most of the “teeth” of the kind of RSP you describe.
(So back to the original point– I think we could say that something is only an RSP if it has the “we won’t cross this red line until we give you a new policy and let you scrutinize it and also tell you how we’re going to reason about when our mitigations are sufficient” property, but then none of the existing commitments would qualify as RSPs. If we loosen the definition, then I think we just go back to “these are voluntary commitments that have to do with scaling & how the lab is thinking about risks from scaling.”)
This seems like a solid list. Scaling certainly seems core to the RSP concept.
IMO “red lines, iterative policy updates, and evaluations & accountability” are sort of pointing at the same thing. Roughly something like “we promise not to cross X red line until we have published Y new policy and allowed the public to scrutinize it for Z amount of time.”
One interesting thing here is that none of the current RSPs meet this standard. I suppose the closest is Anthropic’s, where they say they won’t scale to ASL-4 until they publish a new RSP (this would cover “red line” but I don’t believe they commit to giving the public a chance to scrutinize it, so it would only partially meet “iterative policy updates” and wouldn’t meet “evaluations and accountability.”)
This seems like the meat of an ideal RSP. I don’t think it’s done by any of the existing voluntary scaling commitments. All of them have this flavor of “our company leadership will determine when the mitigations are sufficient, and we do not commit to telling you what our reasoning is.” OpenAI’s PF probably comes the closest, IMO (e.g., leadership will evaluate if the mitigations have moved the model from the stuff described in the “critical risk” category to the stuff described in the “high risk” category.)
As long as the voluntary scaling commitments end up having this flavor of “leadership will make a judgment call based on its best reasoning”, it feels like the commitments lack most of the “teeth” of the kind of RSP you describe.
(So back to the original point– I think we could say that something is only an RSP if it has the “we won’t cross this red line until we give you a new policy and let you scrutinize it and also tell you how we’re going to reason about when our mitigations are sufficient” property, but then none of the existing commitments would qualify as RSPs. If we loosen the definition, then I think we just go back to “these are voluntary commitments that have to do with scaling & how the lab is thinking about risks from scaling.”)