In which worlds would AI Control (or any other agenda which relies on non-trivial post-training operation) prevent significant harm?
When I bring up the issue of AI model security to people working in AI safety, I’m often met with something of the form “yes, this is a problem. It’s important that people work hard on securing AI models. But it doesn’t really affect my work”.
Using AI Control (an area which has recently excited many in the field) as an example, I lay out an argument for why it might not be as effective an agenda as one might think after considering the realities of our cyber security situation.
AI Control concerns itself with models that intentionally try to subvert its developers.
These models are likely to be very generally capable and capable of causing significant harm without countermeasures.
Leading cyber-capable institutions would likely expend significant resources and political capital to steal these models or steal enough insights to reproduce such models.
If the weights or insights are stolen, work on AI control will not prevent these models from causing significant harm.
Current AI developers are not on track to be able to defend against high-priority operations from leading cyber-capable institutions in the coming years.
Therefore, AI control will only be useful in the coming years under one (or more) of these conditions:
Models that scheme are unlikely to be generally capable/dangerous enough to be a high-priority target for leading cyber-capable institutions.
Models that scheme are only developed by actors that can thwart high-priority operations from leading cyber-capable institutions (which precludes current AI developers for at least several years).
AI Control won’t be directly useful in the coming years but it will be indirectly useful to progress the field for when models are developed by actors capable of thwarting top cyber operations.
Even if the model was stolen and caused significant harm, there would still be less harm overall than if the model undergoing AI control measures also broke out. Of course, this argument is much weaker for models capable of existential harm.
Actors that steal the model would also implement strong-enough AI control measures and would not misuse the AI to cause significant harm.
There are of course other arguments against working on AI control. E.g. it may encourage the development and use of models that are capable of causing significant harm. This is an issue if the AI control methods fail or if the model is stolen. So one must be willing to eat this cost or argue that it’s not a large cost when advocating for AI Control work.
This isn’t to say that AI Control isn’t a promising agenda, I just think people need to carefully consider the cases in which their agenda falls down for reasons that aren’t technical arguments about the agenda itself.
I’m also interested to hear takes from those excited by AI Control on which conditions listed in #6 above that they expect to hold (or to otherwise poke holes in the argument).
EDIT (thanks Zach and Ryan for bringing this up): I didn’t want to imply that AI Control is unique here, this argument can be levelled at any agenda which relies on something like a raw model + non-trivial operation effort. E.g. a scheme which relies on interpretability or black box methods for monitoring or scalable oversight.
This argument isn’t control specific, it applies to any safety scheme with some operating tax or implementation difficulty.[1][2]
I agree that if there is insufficient security, paying any sizable safety tax (including slowing down) will be difficult. It would either require substantial coordination or much more agressive USG action than would otherwise be needed.
Thus, I agree that security is a huge factor in how AI goes, plausibly more important than technical safety work done in advance, and I agree that the technical AI safety community probably understimates the importance of security. (Security is probably notably less tractible than technical AI safety measures for a variety of reasons, but perhaps of greater importance than doing technical AI safety in advance.)
That said, I can certainly imagine worlds where AI control saves us despite the model being stolen by multiple adversaries. My sense is that the situation is much worse if a powerful AI is stolen, especially if it is stolen by multiple actors, but it isn’t obviously fatal.
Therefore, AI control will only be useful in the coming years under one (or more) of these conditions
I would note that we don’t just aim to prevent the outcome of the model escaping. Indeed, even if a (misaligned) model was literally made publicly available (on a torrent or whatever), we might still care about safely extracting work out of instances of the model we control. We no longer care about preventing escape, but we do care about getting good non-sabotaged work (on e.g. security or technical AI safety).
We also aren’t just worried about direct sabotage, we want to prevent the AI from setting itself up in a position where it has much more power over the AI company. E.g., we really want to prevent it from being able to run itself without monitoring in a position where it has full permissions to do things within the AI company.
Even if the model has literally already escaped into the outside world, I think control is useful. (Though this is obviously a dismal situation to be in.)
As an example scenario:
China steals the model and is using it heavily.
We still want to be able to use our model without it causing huge problems, though escaping to the outside world isn’t as big of a problem (it still might be better to reduce further proliferation).
We coordinate with China to slow down and to take sufficient precautions that the model doesn’t escape from either of us. This allows for paying the control tax at least for a while. We don’t necesarily need China to do a good job with control other than avoiding their model escaping. Just delaying the point at which it escapes is also pretty helpful.
During the time we have, we use huge amounts of AI labor to figure out a cheaper and more robust alignment strategy or are able to build the case for a stronger level of coordination.
Edit: One further note is that if the tax from control is sufficiently small and takeoff is sufficiently slow by default, then control can still let us get out the key work done on time (prior to adversaries being able to train much more powerful models such that we can’t be competitive anymore).
You also bring up a risk compensation argument: maybe control is bad because it encorages the development of powerful models via making some of the risks lower. This argument also applies to other safety strategies.
Thanks for the thoughts. They’ve made me think that I’m likely underestimating how much Control is needed to get useful work out of AIs capable and inclined to scheme. Ideally, this fact would increase the likelihood of other actors implementing AI Control schemes with the stolen model that are at least sufficient for containment and/or make them less likely to steal the model, though I wouldn’t want to put too much weight on this hope.
>This argument isn’t control specific, it applies to any safety scheme with some operating tax or implementation difficulty.[1][2]
Yep, for sure. I’ve changed the title and commented about this at the end.
Ideally, this fact would increase the likelihood of other actors implementing AI Control schemes with the stolen model that are at least sufficient for containment and/or make them less likely to steal the model, though I wouldn’t want to put too much weight on this hope.
Agreed, it might not be clear to all actors that AI control would be in their interests for reducing risks of sabotage and other bad outcomes!
I think one of the priorities of safety-conscious AI developers should be to develop some realistic and downright terrifying demos to show privately to NatSec folks in US/UK/EU. Gotta get that military-grade security in place before we get too much further.
I’d say a combo of 6B and 6C is my story for AI control working, where AI labor is used to make computer security much better, and importantly security becomes good enough such that even nation-states can’t defeat the security, by a combination of both general security practices becoming better, and AIs becoming good enough at mathematics and coding such that they can generate sound formal proofs without humans being the loop, which is used to defend the most critical infrastructure for AI labs.
This post is probably the best story on how we could get to a state such that AI control is useful:
I think that even after your edit, your argument still applies more broadly than you’re giving it credit for: if computer security is going to go poorly, then we’re facing pretty serious AI risk even if the safety techniques require trivial effort during deployment.
If your AI is stolen, you face substantial risk even if you had been able to align it (e.g. because you might immediately get into an AI enabled war, and you might be forced to proceed with building more powerful and less-likely-to-be-aligned models because of the competitive pressure).
So I think your argument also pushes against working on alignment techniques.
I plan to spend more time thinking about AI model security. The main reasons I’m not spending a lot of time on it now are:
I’m excited about the project/agenda we’ve started working on in interpretability, and my team/org more generally, and I think (or at least I hope) that I have a non-trivial positive influence on it.
I haven’t thought through what the best things to do would be. Some ideas (takes welcome):
Help create RAND or RAND-style reports like Securing AI Model Weights (I think this report is really great). E.g.
Make forecasts about how much interest from adversaries certain models are likely to get, and then how likely the model is to be stolen/compromised given that level of interest and the level defense of the developer. I expect this to be much more speculative than a typical RAND report. It might also require a bunch of non-public info on both offense and defense capabilities.
(not my idea) Make forecasts about how long a lab would take to implement certain levels of security.
Make demos that convince natsec people that AI is or will be very capable and become a top-priority target.
Improve security at a lab (probably requires becoming a full-time employee).
In general, the hacking capabilities of state actors and the likely involvement of national security when we get closer to AGI feel like significant blind spots of Lesswrong discourse.
(The Hacker and The State by Ben Buchanan is a great book to learn about the former)
jbash seems to think us LessWrong posters have no power to affect global-scale outcomes. I disagree. I believe some of us have quite a bit of power, and should use it before we lose it.
I think it’s important to consider hacking in any safety efforts. These hacks would probably include stealing and using any safety methods for control or alignment, for the same reasons the originating org was using them—they don’t want to lose control of their AGI. Better make those techniques and their code public, and publicly advertise why you’re using them!
Of course, we’d worry that some actors (North Korea, Russia, individuals who are skilled hackers) are highly misaligned with the remainder of humanity, andd might bring about existential catastrophes through some combination of foolishness and selfishness.
The other concern is mere proliferation of aligned/controlled systems, which leads to existential danger as soon as those systems approach the capability for autonomous recursive self-improvement: If we solve alignment, do we die anyway?
This might be a reason to try to design AI’s to fail-safe and break without controlling units. E.g. before fine-tuning language models to be useful, fine-tune them to not generate useful content without approval tokens generated by a supervisory model.
I don’t see how that would work technically. It seems like any small set of activating tokens would be stolen along with the weights, and I don’t see how to train it for a large shifting set.
I’m not saying this is impossible, just htat I’m not sure it is. Can you flesh this idea out any further?
Sorry, that was an off-the-cuff example I meant to help gesture towards the main idea. I didn’t mean to imply it’s a working instance (it’s not). The idea I’m going for is:
I’m expecting future AIs to be less single LLMs (like Llama) and more loops and search and scaffolding (like o1)
Those AIs will be composed of individual pieces
Maybe we can try making the AI pieces mutually dependent in such a way that it’s a pain to get the AI working at peak performance unless you include the safety pieces
In which worlds would AI Control (or any other agenda which relies on non-trivial post-training operation) prevent significant harm?
When I bring up the issue of AI model security to people working in AI safety, I’m often met with something of the form “yes, this is a problem. It’s important that people work hard on securing AI models. But it doesn’t really affect my work”.
Using AI Control (an area which has recently excited many in the field) as an example, I lay out an argument for why it might not be as effective an agenda as one might think after considering the realities of our cyber security situation.
AI Control concerns itself with models that intentionally try to subvert its developers.
These models are likely to be very generally capable and capable of causing significant harm without countermeasures.
Leading cyber-capable institutions would likely expend significant resources and political capital to steal these models or steal enough insights to reproduce such models.
If the weights or insights are stolen, work on AI control will not prevent these models from causing significant harm.
Current AI developers are not on track to be able to defend against high-priority operations from leading cyber-capable institutions in the coming years.
Therefore, AI control will only be useful in the coming years under one (or more) of these conditions:
Models that scheme are unlikely to be generally capable/dangerous enough to be a high-priority target for leading cyber-capable institutions.
Models that scheme are only developed by actors that can thwart high-priority operations from leading cyber-capable institutions (which precludes current AI developers for at least several years).
AI Control won’t be directly useful in the coming years but it will be indirectly useful to progress the field for when models are developed by actors capable of thwarting top cyber operations.
Even if the model was stolen and caused significant harm, there would still be less harm overall than if the model undergoing AI control measures also broke out. Of course, this argument is much weaker for models capable of existential harm.
Actors that steal the model would also implement strong-enough AI control measures and would not misuse the AI to cause significant harm.
There are of course other arguments against working on AI control. E.g. it may encourage the development and use of models that are capable of causing significant harm. This is an issue if the AI control methods fail or if the model is stolen. So one must be willing to eat this cost or argue that it’s not a large cost when advocating for AI Control work.
This isn’t to say that AI Control isn’t a promising agenda, I just think people need to carefully consider the cases in which their agenda falls down for reasons that aren’t technical arguments about the agenda itself.
I’m also interested to hear takes from those excited by AI Control on which conditions listed in #6 above that they expect to hold (or to otherwise poke holes in the argument).
EDIT (thanks Zach and Ryan for bringing this up): I didn’t want to imply that AI Control is unique here, this argument can be levelled at any agenda which relies on something like a raw model + non-trivial operation effort. E.g. a scheme which relies on interpretability or black box methods for monitoring or scalable oversight.
This argument isn’t control specific, it applies to any safety scheme with some operating tax or implementation difficulty.[1][2]
I agree that if there is insufficient security, paying any sizable safety tax (including slowing down) will be difficult. It would either require substantial coordination or much more agressive USG action than would otherwise be needed.
Thus, I agree that security is a huge factor in how AI goes, plausibly more important than technical safety work done in advance, and I agree that the technical AI safety community probably understimates the importance of security. (Security is probably notably less tractible than technical AI safety measures for a variety of reasons, but perhaps of greater importance than doing technical AI safety in advance.)
That said, I can certainly imagine worlds where AI control saves us despite the model being stolen by multiple adversaries. My sense is that the situation is much worse if a powerful AI is stolen, especially if it is stolen by multiple actors, but it isn’t obviously fatal.
I would note that we don’t just aim to prevent the outcome of the model escaping. Indeed, even if a (misaligned) model was literally made publicly available (on a torrent or whatever), we might still care about safely extracting work out of instances of the model we control. We no longer care about preventing escape, but we do care about getting good non-sabotaged work (on e.g. security or technical AI safety).
We also aren’t just worried about direct sabotage, we want to prevent the AI from setting itself up in a position where it has much more power over the AI company. E.g., we really want to prevent it from being able to run itself without monitoring in a position where it has full permissions to do things within the AI company.
Even if the model has literally already escaped into the outside world, I think control is useful. (Though this is obviously a dismal situation to be in.)
As an example scenario:
China steals the model and is using it heavily.
We still want to be able to use our model without it causing huge problems, though escaping to the outside world isn’t as big of a problem (it still might be better to reduce further proliferation).
We coordinate with China to slow down and to take sufficient precautions that the model doesn’t escape from either of us. This allows for paying the control tax at least for a while. We don’t necesarily need China to do a good job with control other than avoiding their model escaping. Just delaying the point at which it escapes is also pretty helpful.
During the time we have, we use huge amounts of AI labor to figure out a cheaper and more robust alignment strategy or are able to build the case for a stronger level of coordination.
Edit: One further note is that if the tax from control is sufficiently small and takeoff is sufficiently slow by default, then control can still let us get out the key work done on time (prior to adversaries being able to train much more powerful models such that we can’t be competitive anymore).
You also bring up a risk compensation argument: maybe control is bad because it encorages the development of powerful models via making some of the risks lower. This argument also applies to other safety strategies.
Zach also noted this, but I thought it was worth emphasizing.
Thanks for the thoughts. They’ve made me think that I’m likely underestimating how much Control is needed to get useful work out of AIs capable and inclined to scheme. Ideally, this fact would increase the likelihood of other actors implementing AI Control schemes with the stolen model that are at least sufficient for containment and/or make them less likely to steal the model, though I wouldn’t want to put too much weight on this hope.
>This argument isn’t control specific, it applies to any safety scheme with some operating tax or implementation difficulty.[1][2]
Yep, for sure. I’ve changed the title and commented about this at the end.
Agreed, it might not be clear to all actors that AI control would be in their interests for reducing risks of sabotage and other bad outcomes!
I think one of the priorities of safety-conscious AI developers should be to develop some realistic and downright terrifying demos to show privately to NatSec folks in US/UK/EU. Gotta get that military-grade security in place before we get too much further.
I’d say a combo of 6B and 6C is my story for AI control working, where AI labor is used to make computer security much better, and importantly security becomes good enough such that even nation-states can’t defeat the security, by a combination of both general security practices becoming better, and AIs becoming good enough at mathematics and coding such that they can generate sound formal proofs without humans being the loop, which is used to defend the most critical infrastructure for AI labs.
This post is probably the best story on how we could get to a state such that AI control is useful:
https://www.lesswrong.com/posts/2wxufQWK8rXcDGbyL/access-to-powerful-ai-might-make-computer-security-radically
I agree safety-by-control kinda requires good security. But safety-by-alignment kinda requires good security too.
I think that even after your edit, your argument still applies more broadly than you’re giving it credit for: if computer security is going to go poorly, then we’re facing pretty serious AI risk even if the safety techniques require trivial effort during deployment.
If your AI is stolen, you face substantial risk even if you had been able to align it (e.g. because you might immediately get into an AI enabled war, and you might be forced to proceed with building more powerful and less-likely-to-be-aligned models because of the competitive pressure).
So I think your argument also pushes against working on alignment techniques.
I’m curious @Dan Braun, why don’t you work on computer security (assuming I correctly understand that you don’t)?
I plan to spend more time thinking about AI model security. The main reasons I’m not spending a lot of time on it now are:
I’m excited about the project/agenda we’ve started working on in interpretability, and my team/org more generally, and I think (or at least I hope) that I have a non-trivial positive influence on it.
I haven’t thought through what the best things to do would be. Some ideas (takes welcome):
Help create RAND or RAND-style reports like Securing AI Model Weights (I think this report is really great). E.g.
Make forecasts about how much interest from adversaries certain models are likely to get, and then how likely the model is to be stolen/compromised given that level of interest and the level defense of the developer. I expect this to be much more speculative than a typical RAND report. It might also require a bunch of non-public info on both offense and defense capabilities.
(not my idea) Make forecasts about how long a lab would take to implement certain levels of security.
Make demos that convince natsec people that AI is or will be very capable and become a top-priority target.
Improve security at a lab (probably requires becoming a full-time employee).
In general, the hacking capabilities of state actors and the likely involvement of national security when we get closer to AGI feel like significant blind spots of Lesswrong discourse.
(The Hacker and The State by Ben Buchanan is a great book to learn about the former)
But states seem quite likely to fall under 6e, no?
Quite the opposite, it seems to me, but what you consider “misuse” and “harm” depends on what you value, I suppose.
Maybe we should pick the States we want to be in control, before the States pick for us....
Do you have a preference between Switzerland and North Korea? Cause I sure do.
If we passively let whoever steals the most tech come out as the winner, we might end up with some unpleasant selection effects.
https://www.lesswrong.com/posts/uPi2YppTEnzKG3nXD/nathan-helm-burger-s-shortform?commentId=TwDt9HSh4L3NXFFAF
jbash seems to think us LessWrong posters have no power to affect global-scale outcomes. I disagree. I believe some of us have quite a bit of power, and should use it before we lose it.
I think it’s important to consider hacking in any safety efforts. These hacks would probably include stealing and using any safety methods for control or alignment, for the same reasons the originating org was using them—they don’t want to lose control of their AGI. Better make those techniques and their code public, and publicly advertise why you’re using them!
Of course, we’d worry that some actors (North Korea, Russia, individuals who are skilled hackers) are highly misaligned with the remainder of humanity, andd might bring about existential catastrophes through some combination of foolishness and selfishness.
The other concern is mere proliferation of aligned/controlled systems, which leads to existential danger as soon as those systems approach the capability for autonomous recursive self-improvement: If we solve alignment, do we die anyway?
This might be a reason to try to design AI’s to fail-safe and break without controlling units. E.g. before fine-tuning language models to be useful, fine-tune them to not generate useful content without approval tokens generated by a supervisory model.
I don’t see how that would work technically. It seems like any small set of activating tokens would be stolen along with the weights, and I don’t see how to train it for a large shifting set.
I’m not saying this is impossible, just htat I’m not sure it is. Can you flesh this idea out any further?
Sorry, that was an off-the-cuff example I meant to help gesture towards the main idea. I didn’t mean to imply it’s a working instance (it’s not). The idea I’m going for is:
I’m expecting future AIs to be less single LLMs (like Llama) and more loops and search and scaffolding (like o1)
Those AIs will be composed of individual pieces
Maybe we can try making the AI pieces mutually dependent in such a way that it’s a pain to get the AI working at peak performance unless you include the safety pieces