Zach Stein-Perlman comments on Model evals for dangerous capabilities

Zach Stein-Perlman 23 Sep 2024 11:00 UTC
15 points
0
Correction, October 15: this is wrong; the policy was ambiguous but it didn’t really commit to the thing I thought it did. Mea culpa.
Footnote to table cells D18 and D19:
My reading of Anthropic’s ARA threshold, which nobody has yet contradicted:
1. The RSP defines/operationalizes 50% of ARA tasks (10% of the time) as a sufficient condition for ASL-3. ~~(Sidenote: I think the literal reading is that this is an ASL-3 definition, but I assume it’s supposed to be an operationalization of the safety buffer, 6x below ASL-3.~~^[1])
2. The May RSP evals report suggests 50% of ARA tasks (10% of the time) is merely a yellow line. (Pages 2 and 6, plus page 4 says “ARA Yellow Lines are clearly defined in the RSP” but the RSP’s ARA threshold was not just a yellow line.)
3. This is not kosher; Anthropic needs to formally amend the RSP to [raise the threshold / make the old threshold no longer binding].
(It’s totally plausible that the old threshold was too low, but the solution is to raise it officially, not pretend that it’s just a yellow line.)
(The forthcoming RSP update will make old thresholds moot; I’m just concerned that Anthropic ignored the RSP in the past.)
(Anthropic didn’t cross the threshold and so didn’t violate the RSP — it just ignored the RSP by implying that if it crossed the threshold it wouldn’t necessarily respond as required by the RSP.)
1. ^
  Update: another part of the RSP says this threshold implements the safety buffer.
What links here?
- Anthropic rewrote its RSP by Zach Stein-Perlman (15 Oct 2024 14:25 UTC; 39 points)
- Anthropic rewrote its RSP by Zach Stein-Perlman (EA Forum; 15 Oct 2024 14:30 UTC; 29 points)