This seems mostly right, but it’s worth noting that Anthropic’s current RSP contains “Early Thoughts on ASL-4” and this section includes language like:
Model theft should be prohibitively costly for state-level actors, even with the help of a significant number of employees and the model itself.
_
Safety research: Sufficient progress on the science of interpretability, alignment training, and model evaluations to make an “affirmative case” that our models will not autonomously attempt to strategically undermine our safety measures or cause large-scale catastrophe.
_
red-teaming should confirm that models can’t cause harm quickly enough to evade detection.
In principle these measures could suffice for managing risk from scheming.
(In practice, I expect that making a strong “affirmative case” that modes won’t attempt to undermine safety measures will be very hard and thus likely something else (more like control) will happen.)
I mostly agree. Anthropic is kinda an exception (see footnotes 3, 4, 5, 11, 12, and 14). Anthropic’s stance on model escape is basically like we’ll handle that in ASL-4 — this is reasonable and better than not noticing the threat but they’ve said ~nothing specific about escape/control yet.
I worry that quotes like
red-teaming should confirm that models can’t cause harm quickly enough to evade detection.
are misleading in the context of escape (from all labs, not just Anthropic). When defining ASL-4, Anthropic will need to clarify whether this includes scheming/escape and how safety could be “confirm[ed]” — I think safety properties will need high-effort control evals, not just red-teaming.
This seems mostly right, but it’s worth noting that Anthropic’s current RSP contains “Early Thoughts on ASL-4” and this section includes language like:
_
_
In principle these measures could suffice for managing risk from scheming.
(In practice, I expect that making a strong “affirmative case” that modes won’t attempt to undermine safety measures will be very hard and thus likely something else (more like control) will happen.)
I mostly agree. Anthropic is kinda an exception (see footnotes 3, 4, 5, 11, 12, and 14). Anthropic’s stance on model escape is basically like we’ll handle that in ASL-4 — this is reasonable and better than not noticing the threat but they’ve said ~nothing specific about escape/control yet.
I worry that quotes like
are misleading in the context of escape (from all labs, not just Anthropic). When defining ASL-4, Anthropic will need to clarify whether this includes scheming/escape and how safety could be “confirm[ed]” — I think safety properties will need high-effort control evals, not just red-teaming.