Thanks for the detailed critique – I love that you actually read the document in detail. A few responses on particular points:
The document doesn’t specify whether “deployment” includes internal deployment.
Unless otherwise stated, “deployment” to us means external deployment – because this is the way most AI researchers use the term. Deployment mitigations level 2 discusses the need for mitigations on internal deployments. ML R&D will require thinking about internal deployments (and so will many of the other CCLs).
Some people get unilateral access to weights until the top level. This is disappointing. It’s been almost a year since Anthropic said it was implementing two-party control, where nobody can unilaterally access the weights.
I don’t think Anthropic meant to claim that two-party control would achieve this property. I expect anyone using a cloud compute provider is trusting that the provider will not access the model, not securing it against such unauthorized access. (In principle some cryptographic schemes could allow you to secure model weights even from your cloud compute provider, but I highly doubt people are doing that, since it is very expensive.)
Mostly they discuss developers’ access to the weights. This is disappointing. It’s important but lots of other stuff is important too.
The emphasis on weights access isn’t meant to imply that other kinds of mitigations don’t matter. We focused on what it would take to increase our protection against exfiltration. A lot of the example measures discussed in the RAND interim report aren’t discussed because we already do them. For example, Google already does the following from RAND Level 3: (a) develop an insider threat program and (b) deploy advanced red-teaming. (That’s not meant to be exhaustive, I don’t personally know the details here.)
No mention of evals during deployment (to account for improvements in scaffolding, prompting, etc.).
Sorry, that’s just poor wording on our part—“every 3 months of fine-tuning progress” was meant to capture that as well. Thanks for pointing this out!
Talking about plans like this is helpful. But with no commitments, DeepMind shouldn’t get much credit.
With the FSF, we prefer to try it out for a while and iron out any issues, particularly since the science is in early stages, and best practices will need to evolve as we learn more. But as you say, we are running evals even without official FSF commitments, e.g. the Gemini 1.5 tech report has dangerous capability evaluation results (see Section 9.5.2).
Given recent updates in AGI safety overall, I’m happy that GDM and Google leadership takes commitments seriously, and thinks carefully about which ones they are and are not willing to make. Including FSF, White House Commitments, etc.
Deployment mitigations level 2 discusses the need for mitigations on internal deployments.
Good point; this makes it clearer that “deployment” means external deployment by default. But level 2 only mentions “internal access of the critical capability,” which sounds like it’s about misuse — I’m more worried about AI scheming and escaping when the lab uses AIs internally to do AI development.
ML R&D will require thinking about internal deployments (and so will many of the other CCLs).
OK. I hope DeepMind does that thinking and makes appropriate commitments.
two-party control
Thanks. I’m pretty ignorant on this topic.
“every 3 months of fine-tuning progress” was meant to capture [during deployment] as well
Good point; this makes it clearer that “deployment” means external deployment by default. But level 2 only mentions “internal access of the critical capability,” which sounds like it’s about misuse — I’m more worried about AI scheming and escaping when the lab uses AIs internally to do AI development.
You’re right: our deployment mitigations are targeted at misuse only because our current framework focuses on misuse. As we note in the “Future work” section, we would need to do more work to address risks from misaligned AI. We focused on risks from deliberate misuse initially because they seemed more likely to us to appear first.
Thanks for the detailed critique – I love that you actually read the document in detail. A few responses on particular points:
Unless otherwise stated, “deployment” to us means external deployment – because this is the way most AI researchers use the term. Deployment mitigations level 2 discusses the need for mitigations on internal deployments. ML R&D will require thinking about internal deployments (and so will many of the other CCLs).
I don’t think Anthropic meant to claim that two-party control would achieve this property. I expect anyone using a cloud compute provider is trusting that the provider will not access the model, not securing it against such unauthorized access. (In principle some cryptographic schemes could allow you to secure model weights even from your cloud compute provider, but I highly doubt people are doing that, since it is very expensive.)
The emphasis on weights access isn’t meant to imply that other kinds of mitigations don’t matter. We focused on what it would take to increase our protection against exfiltration. A lot of the example measures discussed in the RAND interim report aren’t discussed because we already do them. For example, Google already does the following from RAND Level 3: (a) develop an insider threat program and (b) deploy advanced red-teaming. (That’s not meant to be exhaustive, I don’t personally know the details here.)
Sorry, that’s just poor wording on our part—“every 3 months of fine-tuning progress” was meant to capture that as well. Thanks for pointing this out!
With the FSF, we prefer to try it out for a while and iron out any issues, particularly since the science is in early stages, and best practices will need to evolve as we learn more. But as you say, we are running evals even without official FSF commitments, e.g. the Gemini 1.5 tech report has dangerous capability evaluation results (see Section 9.5.2).
Given recent updates in AGI safety overall, I’m happy that GDM and Google leadership takes commitments seriously, and thinks carefully about which ones they are and are not willing to make. Including FSF, White House Commitments, etc.
Thanks.
Good point; this makes it clearer that “deployment” means external deployment by default. But level 2 only mentions “internal access of the critical capability,” which sounds like it’s about misuse — I’m more worried about AI scheming and escaping when the lab uses AIs internally to do AI development.
OK. I hope DeepMind does that thinking and makes appropriate commitments.
Thanks. I’m pretty ignorant on this topic.
Yayyy!
You’re right: our deployment mitigations are targeted at misuse only because our current framework focuses on misuse. As we note in the “Future work” section, we would need to do more work to address risks from misaligned AI. We focused on risks from deliberate misuse initially because they seemed more likely to us to appear first.