This argument isn’t control specific, it applies to any safety scheme with some operating tax or implementation difficulty.[1][2]
I agree that if there is insufficient security, paying any sizable safety tax (including slowing down) will be difficult. It would either require substantial coordination or much more agressive USG action than would otherwise be needed.
Thus, I agree that security is a huge factor in how AI goes, plausibly more important than technical safety work done in advance, and I agree that the technical AI safety community probably understimates the importance of security. (Security is probably notably less tractible than technical AI safety measures for a variety of reasons, but perhaps of greater importance than doing technical AI safety in advance.)
That said, I can certainly imagine worlds where AI control saves us despite the model being stolen by multiple adversaries. My sense is that the situation is much worse if a powerful AI is stolen, especially if it is stolen by multiple actors, but it isn’t obviously fatal.
Therefore, AI control will only be useful in the coming years under one (or more) of these conditions
I would note that we don’t just aim to prevent the outcome of the model escaping. Indeed, even if a (misaligned) model was literally made publicly available (on a torrent or whatever), we might still care about safely extracting work out of instances of the model we control. We no longer care about preventing escape, but we do care about getting good non-sabotaged work (on e.g. security or technical AI safety).
We also aren’t just worried about direct sabotage, we want to prevent the AI from setting itself up in a position where it has much more power over the AI company. E.g., we really want to prevent it from being able to run itself without monitoring in a position where it has full permissions to do things within the AI company.
Even if the model has literally already escaped into the outside world, I think control is useful. (Though this is obviously a dismal situation to be in.)
As an example scenario:
China steals the model and is using it heavily.
We still want to be able to use our model without it causing huge problems, though escaping to the outside world isn’t as big of a problem (it still might be better to reduce further proliferation).
We coordinate with China to slow down and to take sufficient precautions that the model doesn’t escape from either of us. This allows for paying the control tax at least for a while. We don’t necesarily need China to do a good job with control other than avoiding their model escaping. Just delaying the point at which it escapes is also pretty helpful.
During the time we have, we use huge amounts of AI labor to figure out a cheaper and more robust alignment strategy or are able to build the case for a stronger level of coordination.
Edit: One further note is that if the tax from control is sufficiently small and takeoff is sufficiently slow by default, then control can still let us get out the key work done on time (prior to adversaries being able to train much more powerful models such that we can’t be competitive anymore).
You also bring up a risk compensation argument: maybe control is bad because it encorages the development of powerful models via making some of the risks lower. This argument also applies to other safety strategies.
Thanks for the thoughts. They’ve made me think that I’m likely underestimating how much Control is needed to get useful work out of AIs capable and inclined to scheme. Ideally, this fact would increase the likelihood of other actors implementing AI Control schemes with the stolen model that are at least sufficient for containment and/or make them less likely to steal the model, though I wouldn’t want to put too much weight on this hope.
>This argument isn’t control specific, it applies to any safety scheme with some operating tax or implementation difficulty.[1][2]
Yep, for sure. I’ve changed the title and commented about this at the end.
Ideally, this fact would increase the likelihood of other actors implementing AI Control schemes with the stolen model that are at least sufficient for containment and/or make them less likely to steal the model, though I wouldn’t want to put too much weight on this hope.
Agreed, it might not be clear to all actors that AI control would be in their interests for reducing risks of sabotage and other bad outcomes!
This argument isn’t control specific, it applies to any safety scheme with some operating tax or implementation difficulty.[1][2]
I agree that if there is insufficient security, paying any sizable safety tax (including slowing down) will be difficult. It would either require substantial coordination or much more agressive USG action than would otherwise be needed.
Thus, I agree that security is a huge factor in how AI goes, plausibly more important than technical safety work done in advance, and I agree that the technical AI safety community probably understimates the importance of security. (Security is probably notably less tractible than technical AI safety measures for a variety of reasons, but perhaps of greater importance than doing technical AI safety in advance.)
That said, I can certainly imagine worlds where AI control saves us despite the model being stolen by multiple adversaries. My sense is that the situation is much worse if a powerful AI is stolen, especially if it is stolen by multiple actors, but it isn’t obviously fatal.
I would note that we don’t just aim to prevent the outcome of the model escaping. Indeed, even if a (misaligned) model was literally made publicly available (on a torrent or whatever), we might still care about safely extracting work out of instances of the model we control. We no longer care about preventing escape, but we do care about getting good non-sabotaged work (on e.g. security or technical AI safety).
We also aren’t just worried about direct sabotage, we want to prevent the AI from setting itself up in a position where it has much more power over the AI company. E.g., we really want to prevent it from being able to run itself without monitoring in a position where it has full permissions to do things within the AI company.
Even if the model has literally already escaped into the outside world, I think control is useful. (Though this is obviously a dismal situation to be in.)
As an example scenario:
China steals the model and is using it heavily.
We still want to be able to use our model without it causing huge problems, though escaping to the outside world isn’t as big of a problem (it still might be better to reduce further proliferation).
We coordinate with China to slow down and to take sufficient precautions that the model doesn’t escape from either of us. This allows for paying the control tax at least for a while. We don’t necesarily need China to do a good job with control other than avoiding their model escaping. Just delaying the point at which it escapes is also pretty helpful.
During the time we have, we use huge amounts of AI labor to figure out a cheaper and more robust alignment strategy or are able to build the case for a stronger level of coordination.
Edit: One further note is that if the tax from control is sufficiently small and takeoff is sufficiently slow by default, then control can still let us get out the key work done on time (prior to adversaries being able to train much more powerful models such that we can’t be competitive anymore).
You also bring up a risk compensation argument: maybe control is bad because it encorages the development of powerful models via making some of the risks lower. This argument also applies to other safety strategies.
Zach also noted this, but I thought it was worth emphasizing.
Thanks for the thoughts. They’ve made me think that I’m likely underestimating how much Control is needed to get useful work out of AIs capable and inclined to scheme. Ideally, this fact would increase the likelihood of other actors implementing AI Control schemes with the stolen model that are at least sufficient for containment and/or make them less likely to steal the model, though I wouldn’t want to put too much weight on this hope.
>This argument isn’t control specific, it applies to any safety scheme with some operating tax or implementation difficulty.[1][2]
Yep, for sure. I’ve changed the title and commented about this at the end.
Agreed, it might not be clear to all actors that AI control would be in their interests for reducing risks of sabotage and other bad outcomes!