Asya Bergal: It seems like people believe there’s going to be some kind of pressure for performance or competitiveness that pushes people to try to make more powerful AI in spite of safety failures. Does that seem untrue to you or like you’re unsure about it?
Rohin Shah: It seems somewhat untrue to me. I recently made a comment about this on the Alignment Forum. People make this analogy between AI x-risk and risk of nuclear war, on mutually assured destruction. That particular analogy seems off to me because with nuclear war, you need the threat of being able to hurt the other side whereas with AI x-risk, if the destruction happens, that affects you too. So there’s no mutually assured destruction type dynamic.
I find this statement very confusing. I wonder if I’m misinterpreting Rohin. Wikipedia says “Mutual(ly) assured destruction (MAD) is a doctrine of military strategy and national security policy in which a full-scale use of nuclear weapons by two or more opposing sides would cause the complete annihilation of both the attacker and the defender (see pre-emptive nuclear strike and second strike).”
A core part of the idea of MAD is that the destruction would be mutual. So “with AI x-risk, if the destruction happens, that affects you too” seems like a reason why MAD is a good analogy, and why the way we engaged in MAD might suggest people would engage in similar brinkmanship or risks with AI x-risk, even if the potential for harm to people’s “own side” would be extreme. There are other reasons why the analogy is imperfect, but the particular feature Rohin mentions seems like a reason why an analogy could be drawn.
1. There are two (or more) actors that are in competition with each other
2. There is a technology such that if one actor deploys it and the other actor doesn’t, the first actor remains the same and the second actor is “destroyed”.
3. If both actors deploy the technology, then both actors are “destroyed”.
(I just made these up right now; you could probably get better versions from papers about MAD.)
Condition 2 doesn’t hold for accident risk from AI: if any actor deploys an unaligned AI, then both actors are destroyed.
I agree I didn’t explain this well in the interview—when I said
if the destruction happens, that affects you too
I should have said something like
if you deploy a dangerous AI system, that affects you too
which is not true for nuclear weapons (deploying a nuke doesn’t affect you in and of itself).
I was already interpreting your comment as “if you deploy a dangerous AI system, that affects you too”. I guess I’m just not sure your condition 2 is actually a key ingredient for the MAD doctrine. From the name, the start of Wikipedia’s description, my prior impressions of MAD, and my general model of how it works, it seems like the key idea is that neither side wants to do the thing, because if they do the thing they get destroyed to.
The US doesn’t want to nuke Russia, because then Russian nukes the US. This seems the same phenomena as some AI lab not wanting to develop and release a misaligned superintelligence (or whatever), because then the misaligned superintelligence would destroy them too. So in the key way, the analogy seems to me to hold. Which would then suggest that, however incautious or cautious society was about nuclear weapons, this analogy alone (if we ignore all other evidence) suggests we may do similar with AI. So it seems to me to suggest that there’s not an important disanalogy that should update us towards expecting safety (i.e., the history of MAD for nukes should only make us expect AI safety to the extent we think MAD for nukes was handled safely).
Condition 2 does seem important for the initial step of the US developing the first nuclear weapon, and other countries trying to do so. Because it did mean that the first country who got it would get an advantage, since it could use it without being destroyed itself, at that point. And that doesn’t apply for extreme AI accidents.
So would your argument instead be something like the following? “The initial development of nuclear weapons did not involve MAD. The first country who got them could use them without being itself harmed. However, the initial development of extremely unsafe, extremely powerful AI would substantially risk the destruction of its creator. So the fact we developed nuclear weapons in the first place may not serve as evidence that we’ll develop extremely unsafe, extremely powerful AI in the first place.”
If so, that’s an interesting argument, and at least at first glance it seems to me to hold up.
Condition 2 is necessary for race dynamics to arise, which is what people are usually worried about.
Suppose that AI systems weren’t going to be useful for anything—the only effect of AI systems was that they posed an x-risk to the world. Then it would still be true that “neither side wants to do the thing, because if they do the thing they get destroyed too”.
Nonetheless, I think in this world, no one ever builds AI systems and so don’t need to worry about x-risk.
That seems reasonable to me. I think what I’m thinking is that that’s a disanalogy between a potential “race” for transformative AI, and the race/motivation for building the first nuclear weapons, rather than a disanalogy between the AI situation and MAD.
So it seems like this disanalogy is a reason to think that the evidene “we built nuclear weapons” is weaker evidence than one might otherwise think for the claim “we’ll build dangerous AI” or the claim “we’ll build AI so in an especially ‘racing’/risky way”. And that seems an important point.
But it seems like “MAD strategies have been used” remains however strong evidence it previously was for the claim “we’ll do dangerous things with AI”. E.g., MAD strategies could still serve as some evidence for the general idea that countries/institutions are sometimes willing to do things that are risky to themselves, and that pose very large negative externalities of risks to others, for strategic reasons. And that general idea still seems to apply at least somewhat to AI.
(I’m not sure this is actually disagreeing with what you meant/believe.)
Suppose you have two events X and Y, such that X causes Y, that is, if not-X were true than not-Y would also be true.
Now suppose there’s some Y’ analogous to Y, and you make the argument A: “since Y happened, Y’ is also likely to happen”. If that’s all you know, I agree that A is reasonable evidence that Y’ is likely to happen. But if you then show that the analogous X’ is not true, while X was true, I think argument A provides ~no evidence.
Example:
“It was raining yesterday, so it will probably rain today.”
“But it was cloudy yesterday, and today it is sunny.”
“Ah. In that case it probably won’t rain.”
I think condition 2 causes racing causes MAD strategies in the case of nuclear weapons; since condition 2 / racing doesn’t hold in the case of AI, the fact that MAD strategies were used for nuclear weapons provides very little evidence about whether similar strategies will be used for AI.
MAD strategies could still serve as some evidence for the general idea that countries/institutions are sometimes willing to do things that are risky to themselves, and that pose very large negative externalities of risks to others, for strategic reasons.
I agree with that sentence interpreted literally. But I think you can change “for strategic reasons” to “in cases where condition 2 holds” and still capture most of the cases in which this happens.
I think I get what you’re saying. Is it roughly the following?
“If an AI race did occur, maybe similar issues to what we saw in MAD might occur; there may well be an analogy there. But there’s a disanalogy between the nuclear weapon case and the AI risk case with regards to the initial race, such that the initial nuclear race provides little/no evidence that a similar AI race may occur. And if a similar AI race doesn’t occur, then the conditions under which MAD-style strategies may arise would not occur. So it might not really matter if there’s an analogy between the AI risk situation if a race occurred and the MAD situation.”
If so, I think that makes sense to me, and it seems an interesting/important argument. Though it seems to suggest something more like “We may be more ok than people might think, as long as we avoid an AI race, and we’ll probably avoid an AI race”, rather than simply “We may be more ok than people might think”. And that distinction might e.g. suggest additional value to strategy/policy/governance work to avoid race dynamics, or to investigate how likely they are. (I don’t think this is disagreeing with you, just highlighting a particular thing a bit more.)
Interesting interview, thanks for sharing it!
I find this statement very confusing. I wonder if I’m misinterpreting Rohin. Wikipedia says “Mutual(ly) assured destruction (MAD) is a doctrine of military strategy and national security policy in which a full-scale use of nuclear weapons by two or more opposing sides would cause the complete annihilation of both the attacker and the defender (see pre-emptive nuclear strike and second strike).”
A core part of the idea of MAD is that the destruction would be mutual. So “with AI x-risk, if the destruction happens, that affects you too” seems like a reason why MAD is a good analogy, and why the way we engaged in MAD might suggest people would engage in similar brinkmanship or risks with AI x-risk, even if the potential for harm to people’s “own side” would be extreme. There are other reasons why the analogy is imperfect, but the particular feature Rohin mentions seems like a reason why an analogy could be drawn.
MAD-style strategies happen when:
1. There are two (or more) actors that are in competition with each other
2. There is a technology such that if one actor deploys it and the other actor doesn’t, the first actor remains the same and the second actor is “destroyed”.
3. If both actors deploy the technology, then both actors are “destroyed”.
(I just made these up right now; you could probably get better versions from papers about MAD.)
Condition 2 doesn’t hold for accident risk from AI: if any actor deploys an unaligned AI, then both actors are destroyed.
I agree I didn’t explain this well in the interview—when I said
I should have said something like
which is not true for nuclear weapons (deploying a nuke doesn’t affect you in and of itself).
I was already interpreting your comment as “if you deploy a dangerous AI system, that affects you too”. I guess I’m just not sure your condition 2 is actually a key ingredient for the MAD doctrine. From the name, the start of Wikipedia’s description, my prior impressions of MAD, and my general model of how it works, it seems like the key idea is that neither side wants to do the thing, because if they do the thing they get destroyed to.
The US doesn’t want to nuke Russia, because then Russian nukes the US. This seems the same phenomena as some AI lab not wanting to develop and release a misaligned superintelligence (or whatever), because then the misaligned superintelligence would destroy them too. So in the key way, the analogy seems to me to hold. Which would then suggest that, however incautious or cautious society was about nuclear weapons, this analogy alone (if we ignore all other evidence) suggests we may do similar with AI. So it seems to me to suggest that there’s not an important disanalogy that should update us towards expecting safety (i.e., the history of MAD for nukes should only make us expect AI safety to the extent we think MAD for nukes was handled safely).
Condition 2 does seem important for the initial step of the US developing the first nuclear weapon, and other countries trying to do so. Because it did mean that the first country who got it would get an advantage, since it could use it without being destroyed itself, at that point. And that doesn’t apply for extreme AI accidents.
So would your argument instead be something like the following? “The initial development of nuclear weapons did not involve MAD. The first country who got them could use them without being itself harmed. However, the initial development of extremely unsafe, extremely powerful AI would substantially risk the destruction of its creator. So the fact we developed nuclear weapons in the first place may not serve as evidence that we’ll develop extremely unsafe, extremely powerful AI in the first place.”
If so, that’s an interesting argument, and at least at first glance it seems to me to hold up.
Condition 2 is necessary for race dynamics to arise, which is what people are usually worried about.
Suppose that AI systems weren’t going to be useful for anything—the only effect of AI systems was that they posed an x-risk to the world. Then it would still be true that “neither side wants to do the thing, because if they do the thing they get destroyed too”.
Nonetheless, I think in this world, no one ever builds AI systems and so don’t need to worry about x-risk.
That seems reasonable to me. I think what I’m thinking is that that’s a disanalogy between a potential “race” for transformative AI, and the race/motivation for building the first nuclear weapons, rather than a disanalogy between the AI situation and MAD.
So it seems like this disanalogy is a reason to think that the evidene “we built nuclear weapons” is weaker evidence than one might otherwise think for the claim “we’ll build dangerous AI” or the claim “we’ll build AI so in an especially ‘racing’/risky way”. And that seems an important point.
But it seems like “MAD strategies have been used” remains however strong evidence it previously was for the claim “we’ll do dangerous things with AI”. E.g., MAD strategies could still serve as some evidence for the general idea that countries/institutions are sometimes willing to do things that are risky to themselves, and that pose very large negative externalities of risks to others, for strategic reasons. And that general idea still seems to apply at least somewhat to AI.
(I’m not sure this is actually disagreeing with what you meant/believe.)
Suppose you have two events X and Y, such that X causes Y, that is, if not-X were true than not-Y would also be true.
Now suppose there’s some Y’ analogous to Y, and you make the argument A: “since Y happened, Y’ is also likely to happen”. If that’s all you know, I agree that A is reasonable evidence that Y’ is likely to happen. But if you then show that the analogous X’ is not true, while X was true, I think argument A provides ~no evidence.
Example:
“It was raining yesterday, so it will probably rain today.”
“But it was cloudy yesterday, and today it is sunny.”
“Ah. In that case it probably won’t rain.”
I think condition 2 causes racing causes MAD strategies in the case of nuclear weapons; since condition 2 / racing doesn’t hold in the case of AI, the fact that MAD strategies were used for nuclear weapons provides very little evidence about whether similar strategies will be used for AI.
I agree with that sentence interpreted literally. But I think you can change “for strategic reasons” to “in cases where condition 2 holds” and still capture most of the cases in which this happens.
I think I get what you’re saying. Is it roughly the following?
“If an AI race did occur, maybe similar issues to what we saw in MAD might occur; there may well be an analogy there. But there’s a disanalogy between the nuclear weapon case and the AI risk case with regards to the initial race, such that the initial nuclear race provides little/no evidence that a similar AI race may occur. And if a similar AI race doesn’t occur, then the conditions under which MAD-style strategies may arise would not occur. So it might not really matter if there’s an analogy between the AI risk situation if a race occurred and the MAD situation.”
If so, I think that makes sense to me, and it seems an interesting/important argument. Though it seems to suggest something more like “We may be more ok than people might think, as long as we avoid an AI race, and we’ll probably avoid an AI race”, rather than simply “We may be more ok than people might think”. And that distinction might e.g. suggest additional value to strategy/policy/governance work to avoid race dynamics, or to investigate how likely they are. (I don’t think this is disagreeing with you, just highlighting a particular thing a bit more.)
Yup, I agree with that summary.