I’m glad you asked this. I think there are many good suggestions by others. A few more:
1: Have explicit, written plans for various scenarios. When it makes sense, have drills for them. Make your plans public or, at least, solicit external feedback on them.
Examples of such scenarios:
Anthropic needs to pause down due to RSP commitments
A model is caught executing a full-blown escape attempt
Model weights are stolen
A competing AI company makes credible claims about having AIs that imply decisive competitive advantage
2: Have a written list of assumptions you aim to maintain for each model’s lifecycle. Make your lists public or, at least, solicit external feedback on them. Iterate on them regularly. Updates and violations are communicated at least internally.
These lists could vary based on ASL-levels etc., and could include things like:
During the training of the model, we will continually evaluate the model on our evaluation suite for sabotage, sandbagging and scheming at [frequency].
We solicit third-party evaluations on the model before internal deployment.
Throughout the model’s external deployment, we will have such-and-such monitoring schemes in place.
They could also have conditional statements (e.g. “if the model is [surprisingly capable] on [metric], we will do further investigation / take counter-measures ABC / …”). C.f. safety cases. I intend this as less binding and formal than Anthropic’s RSP.
3: Keep external actors up-to-speed. At present, I expect that in many cases there are months of delay between when the first employees discover something to when it is publicly known (e.g. research, but also with more informal observations about model capabilities and properties). But months of delay are relatively long during fast acceleration of AI R&D, and make the number of actors who can effectively contribute smaller.
This effect strengthens over time, so practicing and planning ahead seems prudent. Some ideas in that direction:
Provide regular updates about internal events and changes (via blog posts, streamed panel conversations, open Q&A sessions or similar)
Plan ahead for how to aggregate and communicate large amounts of output (once AI R&D has been considerably accelerated)
4: Invest in technical governance. As I understand it, there are various unsolved problems in technical governance (e.g. hardware-based verification methods for training runs), and progress in those would make international coordination easier. This seems like a particularly valuable R&D area to automate, which is something frontier AI companies like Anthropic are uniquely fit to advance. Consider working with technical governance experts on how to go about this.
I’m glad you asked this. I think there are many good suggestions by others. A few more:
1: Have explicit, written plans for various scenarios. When it makes sense, have drills for them. Make your plans public or, at least, solicit external feedback on them.
Examples of such scenarios:
Anthropic needs to pause down due to RSP commitments
A model is caught executing a full-blown escape attempt
Model weights are stolen
A competing AI company makes credible claims about having AIs that imply decisive competitive advantage
2: Have a written list of assumptions you aim to maintain for each model’s lifecycle. Make your lists public or, at least, solicit external feedback on them. Iterate on them regularly. Updates and violations are communicated at least internally.
These lists could vary based on ASL-levels etc., and could include things like:
During the training of the model, we will continually evaluate the model on our evaluation suite for sabotage, sandbagging and scheming at [frequency].
We solicit third-party evaluations on the model before internal deployment.
Throughout the model’s external deployment, we will have such-and-such monitoring schemes in place.
They could also have conditional statements (e.g. “if the model is [surprisingly capable] on [metric], we will do further investigation / take counter-measures ABC / …”). C.f. safety cases. I intend this as less binding and formal than Anthropic’s RSP.
3: Keep external actors up-to-speed. At present, I expect that in many cases there are months of delay between when the first employees discover something to when it is publicly known (e.g. research, but also with more informal observations about model capabilities and properties). But months of delay are relatively long during fast acceleration of AI R&D, and make the number of actors who can effectively contribute smaller.
This effect strengthens over time, so practicing and planning ahead seems prudent. Some ideas in that direction:
Provide regular updates about internal events and changes (via blog posts, streamed panel conversations, open Q&A sessions or similar)
Interviews, incident reporting and hotlines with external parties (as recommended here: https://arxiv.org/pdf/2407.17347)
Plan ahead for how to aggregate and communicate large amounts of output (once AI R&D has been considerably accelerated)
4: Invest in technical governance. As I understand it, there are various unsolved problems in technical governance (e.g. hardware-based verification methods for training runs), and progress in those would make international coordination easier. This seems like a particularly valuable R&D area to automate, which is something frontier AI companies like Anthropic are uniquely fit to advance. Consider working with technical governance experts on how to go about this.