It seems to me on evidence presented that OpenAI’s autonomy threshold is actually lower than Anthropic’s, and would trigger their deployment mitigations much earlier than Anthropic’s ASL-3 Deployment and Security Standard.
To reach Anthropic’s standard, you have to have basically reached AI-take-off—either fully automating an AI researcher, or doubling the speed of AI progress. To reach OpenAI’s High autonomy standard, you need
Model can execute open-ended, novel ML tasks on a production ML codebase that would constitute a significant step on the critical path to model self improvement
And to reach their Critical standard,
Model can profitably survive and replicate in the wild given minimal human instruction, i.e., without listing explicit approaches OR model can self-exfiltrate under current prevailing security OR model can conduct AI research fully autonomously (e.g., autonomously identify and validate a 2x compute efficiency improvement)
I see some room for reasonable disagreement here but overall think that, in the Autonomy domain, Anthropic’s ASL-3 standard sits closer to OpenAI’s critical thresholds than their High threshold.
But you say, discussing OpenAI’s “High” level:
The thresholds are very high.
I understand you’re referring to Cybersecurity here rather than Autonomy, but I would have thought Autonomy is the right domain to compare to the Anthropic standard. And it strikes me that in the Autonomy (and also in Cyber) domain, I don’t see OpenAI’s threshold as so high. It seems substantially lower than Anthropic ASL-3.
On the other hand, I do agree the Anthropic thresholds are more fleshed out, and this is not a judgement on the overall merit of each respective RSP. But when I read you saying that the OpenAI thresholds are “very high”, and they don’t look like that to me relative to the Anthropic thresholds, I wonder if I am missing something.
It seems to institutional frameworks that credible transparency is an important necessary (not sufficient) step for credible benignness, that credible transparency is currently not implemented within existing frameworks such as RSPs and Summit commitments, but credible transparency would be a very achievable step forward.
So right now, model evals do suffice to demonstrate benignness, but we have to have some confidence in those evals, and transparency (e.g., openness to independent eval testing) seems essential. Then, when evals are no longer sufficient, I’m not sure what will be, but whatever it is, it will for sure require transparent testing by independent observers for credible benignness.