Henry from SaferAI claims that the new RSP is weaker and vaguer than the old RSP. Do others have thoughts on this claim? (I haven’t had time to evaluate yet.)
Main Issue: Shift from precise definitions to vague descriptions. The primary issue lies in Anthropic’s shift away from precisely defined capability thresholds and mitigation measures. The new policy adopts more qualitative descriptions, specifying the capability levels they aim to detect and the objectives of mitigations, but it lacks concrete details on the mitigations and evaluations themselves. This shift significantly reduces transparency and accountability, essentially asking us to accept a “trust us to handle it appropriately” approach rather than providing verifiable commitments and metrics.
More from him:
Example: Changes in capability thresholds. To illustrate this change, let’s look at a capability threshold:
1️⃣ Version 1 (V1): AI Security Level 3 (ASL-3) was defined as “The model shows early signs of autonomous self-replication ability, as defined by a 50% aggregate success rate on the tasks listed in [Appendix on Autonomy Evaluations].”
2️⃣ Version 2 (V2): ASL-3 is now defined as “The ability to either fully automate the work of an entry-level remote-only researcher at Anthropic, or cause dramatic acceleration in the rate of effective scaling” (quantified as an increase of approximately 1000x in a year).
In V2, the thresholds are no longer defined by quantitative benchmarks. Anthropic now states that they will demonstrate that the model’s capabilities are below these thresholds when necessary. However, this approach is susceptible to shifting goalposts as capabilities advance.
🔄 Commitment Changes: Dilution of mitigation strategies. A similar trend is evident in their mitigation strategies. Instead of detailing specific measures, they focus on mitigation objectives, stating they will prove these objectives are met when required. This change alters the nature of their commitments.
💡 Key Point: Committing to robust measures and then diluting them significantly is not how genuine commitments are upheld. The general direction of these changes is concerning. By allowing more leeway to decide if a model meets thresholds, Anthropic risks prioritizing scaling over safety, especially as competitive pressures intensify.
I was expecting the RSP to become more specific as technology advances and their risk management process matures, not the other way around.
AI R&D threshold: yes, the threshold is much higher
CBRN threshold: not really, except maybe the new threshold excludes risk from moderately sophisticated nonstate actors
ASL-3 deployment standard: yes; the changes don’t feel huge but the new standard doesn’t feel very meaningful
ASL-3 security standard: no (but both old and new are quite vague)
Vaguer: yes. But the old RSP didn’t really have “precisely defined capability thresholds and mitigation measures.” (The ARA threshold did have that 50% definition, but another part of the RSP suggested those tasks were merely illustrative.)
Henry from SaferAI claims that the new RSP is weaker and vaguer than the old RSP. Do others have thoughts on this claim? (I haven’t had time to evaluate yet.)
More from him:
Weaker:
AI R&D threshold: yes, the threshold is much higher
CBRN threshold: not really, except maybe the new threshold excludes risk from moderately sophisticated nonstate actors
ASL-3 deployment standard: yes; the changes don’t feel huge but the new standard doesn’t feel very meaningful
ASL-3 security standard: no (but both old and new are quite vague)
Vaguer: yes. But the old RSP didn’t really have “precisely defined capability thresholds and mitigation measures.” (The ARA threshold did have that 50% definition, but another part of the RSP suggested those tasks were merely illustrative.)
(New RSP, old RSP.)