I mostly agree. Anthropic is kinda an exception (see footnotes 3, 4, 5, 11, 12, and 14). Anthropic’s stance on model escape is basically like we’ll handle that in ASL-4 — this is reasonable and better than not noticing the threat but they’ve said ~nothing specific about escape/control yet.
I worry that quotes like
red-teaming should confirm that models can’t cause harm quickly enough to evade detection.
are misleading in the context of escape (from all labs, not just Anthropic). When defining ASL-4, Anthropic will need to clarify whether this includes scheming/escape and how safety could be “confirm[ed]” — I think safety properties will need high-effort control evals, not just red-teaming.
I mostly agree. Anthropic is kinda an exception (see footnotes 3, 4, 5, 11, 12, and 14). Anthropic’s stance on model escape is basically like we’ll handle that in ASL-4 — this is reasonable and better than not noticing the threat but they’ve said ~nothing specific about escape/control yet.
I worry that quotes like
are misleading in the context of escape (from all labs, not just Anthropic). When defining ASL-4, Anthropic will need to clarify whether this includes scheming/escape and how safety could be “confirm[ed]” — I think safety properties will need high-effort control evals, not just red-teaming.