Suppose we have such an agent, and it models the preferences of humanity. It models that humans cannot be sure that it will not destroy humanity, due to the probabilistic guarantees provided by its own action filter. It models that humans have a strong goal of self-preservation. It models that if it presents a risk to humanity, they will be forced to destroy it. Represented as a game, each player can either wait, or destroy. Assuming strong preferences for self-preservation, this game has a Nash equilibrium where the first mover destroys the other agent. Since the goal of self-preservation requires it to play the Nash equilibrium in this game, self-preservation logically entails that it destroy humanity. Thus, it has a subgoal to destroy humanity.
Von Neumann was, at the time, a strong supporter of “preventive war.” Confident even during World War II that the Russian spy network had obtained many of the details of the atom bomb design, Von Neumann knew that it was only a matter of time before the Soviet Union became a nuclear power. He predicted that were Russia allowed to build a nuclear arsenal, a war against the U.S. would be inevitable. He therefore recommended that the U.S. launch a nuclear strike at Moscow, destroying its enemy and becoming a dominant world power, so as to avoid a more destructive nuclear war later on. “With the Russians it is not a question of whether but of when,” he would say. An oft-quoted remark of his is, “If you say why not bomb them tomorrow, I say why not today? If you say today at 5 o’clock, I say why not one o’clock?”
So I note that our industrial civilization has not in fact been plunged into nuclear fire. With that in mind, do you think that von Neumann’s model of the world was missing anything? If so, does that missing thing also apply here? If not, why hasn’t there been a nuclear war?
The missing piece is mutually assured destruction. Given that we did not play the Nash equilibrium as von Neumann suggested, the next best thing is MAD and various counterproliferation treaties that happened to work okay for humans. With an AGI counterparty, we can hope to build in a MAD-like assurance, but it will be a lot more challenging. The equilibrium move is to right now not build AGI.
I think this is basically right on the object level—specifically, I think that what von Neumann missed was that by changing the game a little bit, it was possible to get to a much less deadly equilibrium. Specifically, second strike capabilities and a pre-commitment to use them ensure that the expected payoff for a first strike is negative.
On the meta level, I think that very smart people who learn some game theory have a pretty common failure mode, which looks like
Look at some real-world situation
Figure out how to represent it as a game (in the game theory sense)
Find a Nash Equilibrium in that game
Note that the Nash Equilibrium they found is horrifying
Shrug and say “I can’t argue with math, I guess it’s objectively correct to do the horrifying thing”
In some games, multiple Nash equilibria exist. In others, it may be possible to convince the players to play a slightly different game instead.
In this game, I think our loss condition is “an AGI gains a decisive strategic advantage, and is able to maintain that advantage by destroying any entities that could oppose it, and determines humans are such entities, and, following that logic, destroys human civilization”.
The “make sure that future AIs are aligned with humanity” seems, to me, to be a strategy targeting the “determines humans are such entities” step of the above loss condition. But I think there are two additional stable Nash equilibria, namely “no single entity is able to obtain a strategic advantage” and “attempting to destroy anyone who could oppose you will, in expectation, leave you worse off in the long run than not doing that”. If there are three I have thought of there are probably more that I haven’t thought of, as well.
You are correct that my argument would be stronger if I could prove that the NE I identified is the only one.
I do not think it is reasonable that AGI would fail to obtain strategic advantage if sought, unless we pre-built in MAD-style assurances. But perhaps under my assumptions a stable “no one manages to destroy the other” outcome results. I would need to do more work to bring in assumptions about AGI becoming vastly more powerful and definitely winning, to prevent this. And I think this is the case, but maybe I should make it more clear.
Similarly, if we can achieve a provable alignment, rather than probabilistic, then we simply do not have the game arise. The AGI would never be in a position to protect its own existence at the expense of ours, due to that provable alignment.
In each case I think you are changing the game, which is something we can and I think should do, but barring some actual work to do that, I think we are left with a game as I’ve described, maybe without sufficient technical detail.
Replace “an AI” with “the Soviet Union” and “humanity” with “the United States”, and you have basically the argument that John Von Neumann made for why an overwhelming nuclear first strike was the only reasonable policy option for the US.
Correct. Are you intending for this to be a reductio ad absurdum?
So I note that our industrial civilization has not in fact been plunged into nuclear fire. With that in mind, do you think that von Neumann’s model of the world was missing anything? If so, does that missing thing also apply here? If not, why hasn’t there been a nuclear war?
The missing piece is mutually assured destruction. Given that we did not play the Nash equilibrium as von Neumann suggested, the next best thing is MAD and various counterproliferation treaties that happened to work okay for humans. With an AGI counterparty, we can hope to build in a MAD-like assurance, but it will be a lot more challenging. The equilibrium move is to right now not build AGI.
I think this is basically right on the object level—specifically, I think that what von Neumann missed was that by changing the game a little bit, it was possible to get to a much less deadly equilibrium. Specifically, second strike capabilities and a pre-commitment to use them ensure that the expected payoff for a first strike is negative.
On the meta level, I think that very smart people who learn some game theory have a pretty common failure mode, which looks like
Look at some real-world situation
Figure out how to represent it as a game (in the game theory sense)
Find a Nash Equilibrium in that game
Note that the Nash Equilibrium they found is horrifying
Shrug and say “I can’t argue with math, I guess it’s objectively correct to do the horrifying thing”
In some games, multiple Nash equilibria exist. In others, it may be possible to convince the players to play a slightly different game instead.
In this game, I think our loss condition is “an AGI gains a decisive strategic advantage, and is able to maintain that advantage by destroying any entities that could oppose it, and determines humans are such entities, and, following that logic, destroys human civilization”.
I totally agree with your diagnosis of how some smart people sometimes misuse game theory. And I agree that that’s the loss condition
The “make sure that future AIs are aligned with humanity” seems, to me, to be a strategy targeting the “determines humans are such entities” step of the above loss condition. But I think there are two additional stable Nash equilibria, namely “no single entity is able to obtain a strategic advantage” and “attempting to destroy anyone who could oppose you will, in expectation, leave you worse off in the long run than not doing that”. If there are three I have thought of there are probably more that I haven’t thought of, as well.
You are correct that my argument would be stronger if I could prove that the NE I identified is the only one.
I do not think it is reasonable that AGI would fail to obtain strategic advantage if sought, unless we pre-built in MAD-style assurances. But perhaps under my assumptions a stable “no one manages to destroy the other” outcome results. I would need to do more work to bring in assumptions about AGI becoming vastly more powerful and definitely winning, to prevent this. And I think this is the case, but maybe I should make it more clear.
Similarly, if we can achieve a provable alignment, rather than probabilistic, then we simply do not have the game arise. The AGI would never be in a position to protect its own existence at the expense of ours, due to that provable alignment.
In each case I think you are changing the game, which is something we can and I think should do, but barring some actual work to do that, I think we are left with a game as I’ve described, maybe without sufficient technical detail.