It is likely that at ASL-4 we will require a detailed and precise understanding of what is going on inside the model, in order to make an “affirmative case” that the model is safe.
I’d be extremely excited for Anthropic (or ARC or other labs) to say more about what they believe would qualify as an affirmative case for safety. I appreciate this sentence a lot, and I think a “strong version” of affirmative safety (that go beyond “we have not been able to detect danger” toward “we have an understanding of the system we are building and we can make some formal or near-formal guarantees about its dangers”) would be excellent.
On the other hand, a “weak version” of affirmative safety (e.g., “look, we have shown you its safe because the red-teamers could not jailbreak it using existing techniques, so now we’re confident it’s safe & we’re going to deploy it widely & scale by another 10X”) would be much worse than the “strong version”.
So a lot of this will come down to how we interpret and enforce “affirmative safety”, and I’d be excited to see governance proposals that center around this.
Note that the recent FLI scorecard has a column related to affirmative safety (Burden of proof on developer to demonstrate safety?”), and it currently states that Anthropic’s RSP does not put the burden of proof on developers. I think this is an accurate characterization of Anthropic’s current RSP. I hope that future RSPs (from Anthropic or other companies) score better on this dimension.
RSPs are not intended as a substitute for regulation, but rather a prototype for it
Glad that this was said explicitly. I think whether or not RSPs will be a good prototype or building block for regulation will depend a lot on how much RSPs end up adopting strong versions of “affirmative safety”
If I could wave a magic wand and add something to the statement, I’d add something like this:
In the event that companies cannot show affirmative safety, we may need to pause frontier AI development for a long period of time. Anthropic is open to the idea that AI development past a certain computing threshold should be prohibited, except in the context of a multinational organization dedicated to AGI safety. We encourage world leaders to pursue this option, and we would be eager to see progress made on the international agreements needed to make this idea into a reality. (Not a real quote from Dario).
Dario did not say this (or anything like it), and I think that’s my biggest criticism of the statement. The statement reads as “let’s let companies develop safety measures and race to the top”– but this still allows a race to AGI in the first place.
I appreciate Dario for including the bit about affirmative safety. As a next step, I’d like to see him (and other lab leaders) acknowledge that affirmative safety might be extremely difficult, and since it might be, they’re actively excited to see progress on international coordination that could end the race to godlike AI.
(And of course, such statements don’t commit Anthropic to stopping until/unless such international coordination is achieved.)
Thanks for sharing this! A few thoughts:
I’d be extremely excited for Anthropic (or ARC or other labs) to say more about what they believe would qualify as an affirmative case for safety. I appreciate this sentence a lot, and I think a “strong version” of affirmative safety (that go beyond “we have not been able to detect danger” toward “we have an understanding of the system we are building and we can make some formal or near-formal guarantees about its dangers”) would be excellent.
On the other hand, a “weak version” of affirmative safety (e.g., “look, we have shown you its safe because the red-teamers could not jailbreak it using existing techniques, so now we’re confident it’s safe & we’re going to deploy it widely & scale by another 10X”) would be much worse than the “strong version”.
So a lot of this will come down to how we interpret and enforce “affirmative safety”, and I’d be excited to see governance proposals that center around this.
Note that the recent FLI scorecard has a column related to affirmative safety (Burden of proof on developer to demonstrate safety?”), and it currently states that Anthropic’s RSP does not put the burden of proof on developers. I think this is an accurate characterization of Anthropic’s current RSP. I hope that future RSPs (from Anthropic or other companies) score better on this dimension.
Glad that this was said explicitly. I think whether or not RSPs will be a good prototype or building block for regulation will depend a lot on how much RSPs end up adopting strong versions of “affirmative safety”
If I could wave a magic wand and add something to the statement, I’d add something like this:
Dario did not say this (or anything like it), and I think that’s my biggest criticism of the statement. The statement reads as “let’s let companies develop safety measures and race to the top”– but this still allows a race to AGI in the first place.
I appreciate Dario for including the bit about affirmative safety. As a next step, I’d like to see him (and other lab leaders) acknowledge that affirmative safety might be extremely difficult, and since it might be, they’re actively excited to see progress on international coordination that could end the race to godlike AI.
(And of course, such statements don’t commit Anthropic to stopping until/unless such international coordination is achieved.)