My reading of Anthropic’s ARA threshold, which nobody has yet contradicted:
The RSP defines/operationalizes 50% of ARA tasks (10% of the time) as a sufficient condition for ASL-3. (Sidenote: I think the literal reading is that this is an ASL-3 definition, but I assume it’s supposed to be an operationalization of the safety buffer, 6x below ASL-3.[1])
The May RSP evals report suggests 50% of ARA tasks (10% of the time) is merely a yellow line. (Pages 2 and 6, plus page 4 says “ARA Yellow Lines are clearly defined in the RSP” but the RSP’s ARA threshold was not just a yellow line.)
This is not kosher; Anthropic needs to formally amend the RSP to [raise the threshold / make the old threshold no longer binding].
(It’s totally plausible that the old threshold was too low, but the solution is to raise it officially, not pretend that it’s just a yellow line.)
(The forthcoming RSP update will make old thresholds moot; I’m just concerned that Anthropic ignored the RSP in the past.)
(Anthropic didn’t cross the threshold and so didn’t violate the RSP — it just ignored the RSP by implying that if it crossed the threshold it wouldn’t necessarily respond as required by the RSP.)
Correction, October 15: this is wrong; the policy was ambiguous but it didn’t really commit to the thing I thought it did. Mea culpa.
Footnote to table cells D18 and D19:
My reading of Anthropic’s ARA threshold, which nobody has yet contradicted:
The RSP defines/operationalizes 50% of ARA tasks (10% of the time) as a sufficient condition for ASL-3.
(Sidenote: I think the literal reading is that this is an ASL-3 definition, but I assume it’s supposed to be an operationalization of the safety buffer, 6x below ASL-3.[1])The May RSP evals report suggests 50% of ARA tasks (10% of the time) is merely a yellow line. (Pages 2 and 6, plus page 4 says “ARA Yellow Lines are clearly defined in the RSP” but the RSP’s ARA threshold was not just a yellow line.)
This is not kosher; Anthropic needs to formally amend the RSP to [raise the threshold / make the old threshold no longer binding].
(It’s totally plausible that the old threshold was too low, but the solution is to raise it officially, not pretend that it’s just a yellow line.)
(The forthcoming RSP update will make old thresholds moot; I’m just concerned that Anthropic ignored the RSP in the past.)
(Anthropic didn’t cross the threshold and so didn’t violate the RSP — it just ignored the RSP by implying that if it crossed the threshold it wouldn’t necessarily respond as required by the RSP.)
Update: another part of the RSP says this threshold implements the safety buffer.