I’m glad you asked this. I think there are many good suggestions by others. A few more:
1: Have explicit, written plans for various scenarios. When it makes sense, have drills for them. Make your plans public or, at least, solicit external feedback on them.
Examples of such scenarios:
Anthropic needs to pause down due to RSP commitments
A model is caught executing a full-blown escape attempt
Model weights are stolen
A competing AI company makes credible claims about having AIs that imply decisive competitive advantage
2: Have a written list of assumptions you aim to maintain for each model’s lifecycle. Make your lists public or, at least, solicit external feedback on them. Iterate on them regularly. Updates and violations are communicated at least internally.
These lists could vary based on ASL-levels etc., and could include things like:
During the training of the model, we will continually evaluate the model on our evaluation suite for sabotage, sandbagging and scheming at [frequency].
We solicit third-party evaluations on the model before internal deployment.
Throughout the model’s external deployment, we will have such-and-such monitoring schemes in place.
They could also have conditional statements (e.g. “if the model is [surprisingly capable] on [metric], we will do further investigation / take counter-measures ABC / …”). C.f. safety cases. I intend this as less binding and formal than Anthropic’s RSP.
3: Keep external actors up-to-speed. At present, I expect that in many cases there are months of delay between when the first employees discover something to when it is publicly known (e.g. research, but also with more informal observations about model capabilities and properties). But months of delay are relatively long during fast acceleration of AI R&D, and make the number of actors who can effectively contribute smaller.
This effect strengthens over time, so practicing and planning ahead seems prudent. Some ideas in that direction:
Provide regular updates about internal events and changes (via blog posts, streamed panel conversations, open Q&A sessions or similar)
Interviews, incident reporting and hotlines with external parties (as recommended here: https://arxiv.org/pdf/2407.17347)
Plan ahead for how to aggregate and communicate large amounts of output (once AI R&D has been considerably accelerated)
4: Invest in technical governance. As I understand it, there are various unsolved problems in technical governance (e.g. hardware-based verification methods for training runs), and progress in those would make international coordination easier. This seems like a particularly valuable R&D area to automate, which is something frontier AI companies like Anthropic are uniquely fit to advance. Consider working with technical governance experts on how to go about this.
Thank you for this post. I agree this is important, and I’d like to see improved plans.
Three comments on such plans.
1: Technical research and work.
(I broadly agree with the technical directions listed deserving priority.)
I’d want these plans to explicitly consider the effects of AI R&D acceleration, as those are significant. The speedups vary based on how constrained projects are on labor vs. compute; those that are mostly bottle-necked on labor could be massively sped up. (For instance, evaluations seem primarily labor-constrained to me.)
The lower costs of labor have other implications as well, likely including security (see also here) and technical governance (making better verification methods technically feasible).
2: The high-level strategy
If I were to now write a plan for two-to-three-year timelines, the high-level strategy I’d choose is:
Don’t build generally vastly superhuman AIs. Use whatever technical methods we have now to control and align AIs which are less capable than that. Drastically speed up (technical) governance work with the AIs we have.[1] Push for governments and companies to enforce the no-vastly-superhuman-AIs rule.
Others might have different strategies; I’d like these plans to discuss what the high-level strategy or aims are.
3: Organizational competence
Reasoning transparency and safety first culture are mentioned in the post (in Layer 2), but I’d further prioritize and plan organizational aspects, even when aiming for “the bare minimum”. Beside the general importance of organizational competence, there are two specific reasons for this:
If and when AI R&D acceleration is very fast, delays in information propagating to outsiders are more costly. That is: insofar as you want to keep external actors “in the loop” and contribute, you need to put more effort into communicating what is happening internally.
Organizational competence and technical work are not fully at odds, as there are employees specialized in different things anyways.
(I think the responses to Evan Hubinger’s request for takes on what Anthropic should do differently has useful ideas for planning here.)
Note: I’m not technically knowledgeable on the field.