Thanks for the thoughts. They’ve made me think that I’m likely underestimating how much Control is needed to get useful work out of AIs capable and inclined to scheme. Ideally, this fact would increase the likelihood of other actors implementing AI Control schemes with the stolen model that are at least sufficient for containment and/or make them less likely to steal the model, though I wouldn’t want to put too much weight on this hope.
>This argument isn’t control specific, it applies to any safety scheme with some operating tax or implementation difficulty.[1][2]
Yep, for sure. I’ve changed the title and commented about this at the end.
I plan to spend more time thinking about AI model security. The main reasons I’m not spending a lot of time on it now are:
I’m excited about the project/agenda we’ve started working on in interpretability, and my team/org more generally, and I think (or at least I hope) that I have a non-trivial positive influence on it.
I haven’t thought through what the best things to do would be. Some ideas (takes welcome):
Help create RAND or RAND-style reports like Securing AI Model Weights (I think this report is really great). E.g.
Make forecasts about how much interest from adversaries certain models are likely to get, and then how likely the model is to be stolen/compromised given that level of interest and the level defense of the developer. I expect this to be much more speculative than a typical RAND report. It might also require a bunch of non-public info on both offense and defense capabilities.
(not my idea) Make forecasts about how long a lab would take to implement certain levels of security.
Make demos that convince natsec people that AI is or will be very capable and become a top-priority target.
Improve security at a lab (probably requires becoming a full-time employee).