In the previous RSP, I had the sense that Anthropic was attempting to draw red lines—points at which, if models passed certain evaluations, Anthropic committed to pause and develop new safeguards. That is, if evaluations triggered, then they would implement safety measures. The “if” was already sketchy in the first RSP, as Anthropic was allowed to “determine whether the evaluation was overly conservative,” i.e., they were allowed to retroactively declare red lines green. Indeed, with such caveats it was difficult for me to see the RSP as much more than a declared intent to act responsibly, rather than a commitment. But the updated RSP seems to be far worse, even, than that: the “if” is no longer dependent on the outcomes of pre-specified evaluations, but on the personal judgment of Dario Amodei and Jared Kaplan.
Indeed, such red lines are now made more implicit and ambiguous. There are no longer predefined evaluations—instead employees design and run them on the fly, and compile the resulting evidence into a Capability Report, which is sent to the CEO for review. A CEO who, to state the obvious, is hugely incentivized to decide to deploy models, since refraining to do so might jeopardize the company.
This seems strictly worse to me. Some room for flexibility is warranted, but this strikes me as almost maximally flexible, in that practically nothing is predefined—not evaluations, nor safeguards, nor responses to evaluations. This update makes the RSP more subjective, qualitative, and ambiguous. And if Anthropic is going to make the RSP weaker, I wish this were noted more as an apology, or along with a promise to rectify this in the future. Especially because after a year, Anthropic presumably has more information about the risk than before. Why, then, is even more flexibility needed now? What would cause Anthropic to make clear commitments?
I also find it unsettling that the ASL-3 risk threshold has been substantially changed, and the reasoning for this is not explained. In the first RSP, a model was categorized as ASL-3 if it was capable of various precursors for autonomous replication. Now, this has been downgraded to a “checkpoint,” a point at which they promise to evaluate the situation more thoroughly, but don’t commit to taking any particular actions:
We replaced our previous autonomous replication and adaption (ARA) threshold with a “checkpoint” for autonomous AI capabilities. Rather than triggering higher safety standards automatically, reaching this checkpoint will prompt additional evaluation of the model’s capabilities and accelerate our preparation of stronger safeguards.
This strikes me as a big change. The ability to self-replicate is already concerning, but the ability to perform AI R&D seems potentially catastrophic, risking loss of control or extinction. Why does Anthropic now think this shouldn’t count as ASL-3? Why have they substituted this criteria with a substantially riskier one instead?
Dario estimates the probability of something going “really quite catastrophically wrong, on the scale of human civilization” as between 10-25%. He also thinks this might happen soon—perhaps between 2025-2027. It seems obvious to me that a policy this ambiguous, this dependent on figuring things out on the fly, this beset with such egregious conflicts of interest, is a radically insufficient means of managing risk from a technology which poses so grave and imminent a threat to our world.
Indeed, such red lines are now made more implicit and ambiguous. There are no longer predefined evaluations—instead employees design and run them on the fly, and compile the resulting evidence into a Capability Report, which is sent to the CEO for review. A CEO who, to state the obvious, is hugely incentivized to decide to deploy models, since refraining to do so might jeopardize the company.
This doesn’t seem right to me, though it’s possible that I’m misreading either the old or new policy (or both).
Re: predefined evaluations, the old policy neither specified any evaluations in full detail, nor did it suggest that Anthropic would have designed the evaluations prior to a training run. (Though I’m not sure that’s what you meant, when contrasted it with “employees design and run them on the fly” as a description of the new policy.)
Re: CEO’s decisionmaking, my understanding of the new policy is that the CEO (and RSO) will be responsible only for approving or denying an evaluation report making an affirmative case that a new model does not cross a relevant capability threshold (“3.3 Capability Decision”, original formatting removed, all new bolding is mine):
If, after the comprehensive testing, we determine that the model is sufficiently below the relevant Capability Thresholds, then we will continue to apply the ASL-2 Standard. The process for making such a determination is as follows:
First, we will compile a Capability Report that documents the findings from the comprehensive assessment, makes an affirmative case for why the Capability Threshold is sufficiently far away, and advances recommendations on deployment decisions.
The report will be escalated to the CEO and the Responsible Scaling Officer, who will (1) make the ultimate determination as to whether we have sufficiently established that we are unlikely to reach the Capability Threshold and (2) decide any deployment-related issues.
In general, as noted in Sections 7.1.4 and 7.2.2, we will solicit both internal and external expert feedback on the report as well as the CEO and RSO’s conclusions to inform future refinements to our methodology. For high-stakes issues, however, the CEO and RSO will likely solicit internal and external feedback on the report prior to making any decisions.
If the CEO and RSO decide to proceed with deployment, they will share their decision–as well as the underlying Capability Report, internal feedback, and any external feedback–with the Board of Directors and the Long-Term Benefit Trust before moving forward.
The same is true for the “Safeguards Decision” (i.e. making an affirmative case that ASL-3 Required Safeguards have been sufficiently implemented, given that there is a model that has passed the relevant capabilities thresholds).
This is not true for the “Interim Measures” described as an allowable stopgap if Anthropic finds itself in the situation of having a model that requires ASL-3 Safeguards but is unable to implement those safeguards. My current read is that this is intended to cover the case where the “Capability Decision” report made the case that a model did not cross into requiring ASL-3 Safeguards, was approved by the CEO & RSO, and then later turned out to be wrong. It does seem like this permits more or less indefinite deployment of a model that requires ASL-3 Safeguards by way of “interim measures” which need to provide “the the same level of assurance as the relevant ASL-3 Standard”, with no provision for what to do if it turns out that implementing the actually-specified ASL-3 standard is intractable. This seems slightly worse than the old policy:
If it becomes apparent that the capabilities of a deployed model have been under-elicited and the model can, in fact, pass the evaluations, then we will halt further deployment to new customers and assess existing deployment cases for any serious risks which would constitute a safety emergency. Given the safety buffer, de-deployment should not be necessary in the majority of deployment cases. If we identify a safety emergency, we will work rapidly to implement the minimum additional safeguards needed to allow responsible continued service to existing customers. We will provide transparency and support to impacted customers throughout the process. An emergency of this type would merit a detailed post-mortem and a policy shift to avoid re-occurrence of this situation.
which has much the same immediate impact, but with at least a nod to a post-mortem and policy adjustment.
But, overall, the new policy doesn’t seem to be opening up a gigantic hole that allows Dario to press the “all clear” button on capability determinations; he only has the additional option to veto, after the responsible team has already decided the model doesn’t cross the threshold.
Thanks, I think you’re right on both points—that the old RSP also didn’t require pre-specified evals, and that the section about Capability Reports just describes the process for non-threshold-triggering eval results—so I’ve retracted those parts of my comment; my apologies for the error. I’m on vacation right now so was trying to read quickly, but I should have checked more closely before commenting.
That said, it does seem to me like the “if/then” relationships in this RSP have been substantially weakened. The previous RSP contained sufficiently much wiggle room that I didn’t interpret it as imposing real constraints on Anthropic’s actions; but it did at least seem to me to be aiming at well-specified “if’s,” i.e., ones which depended on the results of specific evaluations. Like, the previous RSP describes their response policy as: “If an evaluation threshold triggers, we will follow the following procedure” (emphasis mine), where the trigger for autonomous risk happens if “at least 50% of the tasks are passed.”
In other words, the “if’s” in the first RSP seemed more objective to me; the current RSP strikes me as a downgrade in that respect. Now, instead of an evaluation threshold, the “if” is determined by some opaque internal process at Anthropic that the document largely doesn’t describe. I think in practice this is what was happening before—i.e., that the policy basically reduced to Anthropic crudely eyeballing the risk—but it’s still disappointing to me to see this level of subjectivity more actively codified into policy.
My impression is also that this RSP is more Orwellian than the first one, and this is part of what I was trying to gesture at. Not just that their decision process has become more ambiguous and subjective, but that the whole thing seems designed to be glossed over, such that descriptions of risks won’t really load in readers’ minds. This RSP seems much sparser on specifics, and much heavier on doublespeak—e.g., they use the phrase “unable to make the required showing” to mean “might be terribly dangerous.” It also seems to me to describe many things too vaguely to easily argue against. For example, they claim they will “explain why the tests yielded such results,” but my understanding is that this is mostly not possible yet, i.e., that it’s an open scientific question, for most such tests, why their models produce the behavior they do. But without knowing what “tests” they mean, nor the sort of explanations they’re aiming for, it’s hard to argue with; I’m suspicious this is intentional.
In the previous RSP, I had the sense that Anthropic was attempting to draw red lines—points at which, if models passed certain evaluations, Anthropic committed to pause and develop new safeguards. That is,ifevaluations triggered,thenthey would implement safety measures. The “if” was already sketchy in the first RSP, as Anthropic was allowed to “determine whether the evaluation was overly conservative,” i.e., they were allowed to retroactively declare red lines green. Indeed, with such caveats it was difficult for me to see the RSP as much more than a declared intent to act responsibly, rather than a commitment. But the updated RSP seems to be far worse, even, than that: the “if” is no longer dependent on the outcomes of pre-specified evaluations, but on the personal judgment of Dario Amodei and Jared Kaplan.Indeed, such red lines are now mademoreimplicit and ambiguous. There are no longer predefined evaluations—instead employees design and run them on the fly, and compile the resulting evidence into a Capability Report, which is sent to the CEO for review. A CEO who, to state the obvious, is hugely incentivized to decide to deploy models, since refraining to do so might jeopardize the company.This seems strictly worse to me. Some room for flexibility is warranted, but this strikes me as almostmaximallyflexible, in that practically nothing is predefined—not evaluations, nor safeguards, nor responses to evaluations. This update makes the RSP more subjective, qualitative, and ambiguous. And if Anthropic is going to make the RSP weaker, I wish this were noted more as an apology, or along with a promise to rectify this in the future. Especially because after a year, Anthropic presumably has more information about the risk than before. Why, then, is even more flexibility needed now? Whatwouldcause Anthropic to make clear commitments?I also find it unsettling that the ASL-3 risk threshold has been substantially changed, and the reasoning for this is not explained. In the first RSP, a model was categorized as ASL-3 if it was capable of various precursors for autonomous replication. Now, this has been downgraded to a “checkpoint,” a point at which they promise to evaluate the situation more thoroughly, but don’t commit to taking any particular actions:
This strikes me as a big change. The ability to self-replicate is already concerning, but the ability to perform AI R&D seems potentially catastrophic, risking loss of control or extinction. Why does Anthropic now think this shouldn’t count as ASL-3? Why have they substituted this criteria with a substantially riskier one instead?
Dario estimates the probability of something going “really quite catastrophically wrong, on the scale of human civilization” as between 10-25%. He also thinks this might happen soon—perhaps between 2025-2027. It seems obvious to me that a policy this ambiguous, this dependent on figuring things out on the fly, this beset with such egregious conflicts of interest, is a radically insufficient means of managing risk from a technology which poses so grave and imminent a threat to our world.
This doesn’t seem right to me, though it’s possible that I’m misreading either the old or new policy (or both).
Re: predefined evaluations, the old policy neither specified any evaluations in full detail, nor did it suggest that Anthropic would have designed the evaluations prior to a training run. (Though I’m not sure that’s what you meant, when contrasted it with “employees design and run them on the fly” as a description of the new policy.)
Re: CEO’s decisionmaking, my understanding of the new policy is that the CEO (and RSO) will be responsible only for approving or denying an evaluation report making an affirmative case that a new model does not cross a relevant capability threshold (“3.3 Capability Decision”, original formatting removed, all new bolding is mine):
The same is true for the “Safeguards Decision” (i.e. making an affirmative case that ASL-3 Required Safeguards have been sufficiently implemented, given that there is a model that has passed the relevant capabilities thresholds).
This is not true for the “Interim Measures” described as an allowable stopgap if Anthropic finds itself in the situation of having a model that requires ASL-3 Safeguards but is unable to implement those safeguards. My current read is that this is intended to cover the case where the “Capability Decision” report made the case that a model did not cross into requiring ASL-3 Safeguards, was approved by the CEO & RSO, and then later turned out to be wrong. It does seem like this permits more or less indefinite deployment of a model that requires ASL-3 Safeguards by way of “interim measures” which need to provide “the the same level of assurance as the relevant ASL-3 Standard”, with no provision for what to do if it turns out that implementing the actually-specified ASL-3 standard is intractable. This seems slightly worse than the old policy:
which has much the same immediate impact, but with at least a nod to a post-mortem and policy adjustment.
But, overall, the new policy doesn’t seem to be opening up a gigantic hole that allows Dario to press the “all clear” button on capability determinations; he only has the additional option to veto, after the responsible team has already decided the model doesn’t cross the threshold.
Thanks, I think you’re right on both points—that the old RSP also didn’t require pre-specified evals, and that the section about Capability Reports just describes the process for non-threshold-triggering eval results—so I’ve retracted those parts of my comment; my apologies for the error. I’m on vacation right now so was trying to read quickly, but I should have checked more closely before commenting.
That said, it does seem to me like the “if/then” relationships in this RSP have been substantially weakened. The previous RSP contained sufficiently much wiggle room that I didn’t interpret it as imposing real constraints on Anthropic’s actions; but it did at least seem to me to be aiming at well-specified “if’s,” i.e., ones which depended on the results of specific evaluations. Like, the previous RSP describes their response policy as: “If an evaluation threshold triggers, we will follow the following procedure” (emphasis mine), where the trigger for autonomous risk happens if “at least 50% of the tasks are passed.”
In other words, the “if’s” in the first RSP seemed more objective to me; the current RSP strikes me as a downgrade in that respect. Now, instead of an evaluation threshold, the “if” is determined by some opaque internal process at Anthropic that the document largely doesn’t describe. I think in practice this is what was happening before—i.e., that the policy basically reduced to Anthropic crudely eyeballing the risk—but it’s still disappointing to me to see this level of subjectivity more actively codified into policy.
My impression is also that this RSP is more Orwellian than the first one, and this is part of what I was trying to gesture at. Not just that their decision process has become more ambiguous and subjective, but that the whole thing seems designed to be glossed over, such that descriptions of risks won’t really load in readers’ minds. This RSP seems much sparser on specifics, and much heavier on doublespeak—e.g., they use the phrase “unable to make the required showing” to mean “might be terribly dangerous.” It also seems to me to describe many things too vaguely to easily argue against. For example, they claim they will “explain why the tests yielded such results,” but my understanding is that this is mostly not possible yet, i.e., that it’s an open scientific question, for most such tests, why their models produce the behavior they do. But without knowing what “tests” they mean, nor the sort of explanations they’re aiming for, it’s hard to argue with; I’m suspicious this is intentional.