I agree with the point that we shouldn’t model the AI situation as a zero-sum game. And the kinds of conditional commitments you write about could help with cooperation. But I don’t buy the claim that “implementing this protocol (including slowing down AI capabilities) is what maximizes their utility.”
Here’s a pedantic toy model of the situation, so that we’re on the same page: The value of the whole lightcone going towards an agent’s values has utility 1 by that agent’s lights (and 0 by the other’s), and P(alignment success by someone) = 0 if both speed up, else 1. For each of the alignment success scenarios i, the winner chooses a fraction of the lightcone to give to Alice’s values (xi^A for Alice’s choice, xi^B for Bob’s). Then, some random numbers for expected payoffs (assuming the players agree on the probabilities):
Payoffs for Alice and Bob if they both speed up capabilities: (0, 0)
Payoffs if Alice speeds, Bob doesn’t: 0.8 * (x1^A, 1 - x1^A) + 0.2 * (x1^B, 1 - x1^B)
Payoffs if Bob speeds, Alice doesn’t: 0.2 * (x2^A, 1 - x2^A) + 0.8 * (x2^B, 1 - x2^B)
So given this model, seems that you’re saying Bob has an incentive to slow down capabilities because Alice’s ASI successor can condition the allocation to Bob’s values on his decision. Which we can model as Bob expecting Alice to use the strategy {don’t speed; x2^A = 1; x3^A = 0.5} (given she [edit: typo] doesn’t speed up, she only rewards Bob’s values if Bob didn’t speed up).
Why would Bob so confidently expect this strategy? You write:
And Bob doesn’t have to count on wishful thinking to know that Alice would indeed do this instead of defecting, because in worlds where he wins, he can his superintelligence if Alice would implement this procedure.
I guess the claim is just that them both using this procedure is a Nash equilibrium? If so, I see several problems with this:
There are more Pareto-efficient equilibria than just “[fairly] cooperate” here. Alice could just as well expect Bob to be content with getting expected utility 0.2 from the outcome where he slows down and Alice speeds up — better that than the utility 0 from extinction, after all. Alice might think she can make it credible to Bob that she won’t back down from speeding up capabilities, and vice versa, such that they both end up pursuing incompatible demands. (See, e.g., “miscoordination” here.)
You’re lumping “(a) slow down capabilities and (b) tell your AI to adopt a compromise utility function” into one procedure. I guess the idea is that, ideally, the winner of the race could have their AI check whether the loser was committed to do both (a) and (b). But realistically it seems implausible to me that Alice or Bob can commit to (b) before winning the race, i.e., that what they do in the time before they win the race determines whether they’ll do (b). They can certainly tellthemselves they intend to do (b), but that’s cheap talk.
So it seems Alice would likely think, “If I follow the whole procedure, Bob will cooperate with my values if I lose. But even if I slow down (do (a)), I don’t know if my future self [or, maybe more realistically, the other successors who might take power] will do (b) — indeed once they’re in that position, they’ll have no incentive to do (b). So slowing down isn’t clearly better.” (I do think, setting aside the bargaining problem in (1), she has an incentive to try to make it more likely that her successors follow (b), to be clear.)
I agree with the point that we shouldn’t model the AI situation as a zero-sum game. And the kinds of conditional commitments you write about could help with cooperation. But I don’t buy the claim that “implementing this protocol (including slowing down AI capabilities) is what maximizes their utility.”
Here’s a pedantic toy model of the situation, so that we’re on the same page: The value of the whole lightcone going towards an agent’s values has utility 1 by that agent’s lights (and 0 by the other’s), and P(alignment success by someone) = 0 if both speed up, else 1. For each of the alignment success scenarios i, the winner chooses a fraction of the lightcone to give to Alice’s values (xi^A for Alice’s choice, xi^B for Bob’s). Then, some random numbers for expected payoffs (assuming the players agree on the probabilities):
Payoffs for Alice and Bob if they both speed up capabilities: (0, 0)
Payoffs if Alice speeds, Bob doesn’t: 0.8 * (x1^A, 1 - x1^A) + 0.2 * (x1^B, 1 - x1^B)
Payoffs if Bob speeds, Alice doesn’t: 0.2 * (x2^A, 1 - x2^A) + 0.8 * (x2^B, 1 - x2^B)
Payoffs if neither speeds: 0.5 * (x3^A, 1 - x3^A) + 0.5 * (x3^B, 1 - x3^B)
So given this model, seems that you’re saying Bob has an incentive to slow down capabilities because Alice’s ASI successor can condition the allocation to Bob’s values on his decision. Which we can model as Bob expecting Alice to use the strategy {don’t speed; x2^A = 1; x3^A = 0.5} (given she [edit: typo] doesn’t speed up, she only rewards Bob’s values if Bob didn’t speed up).
Why would Bob so confidently expect this strategy? You write:
I guess the claim is just that them both using this procedure is a Nash equilibrium? If so, I see several problems with this:
There are more Pareto-efficient equilibria than just “[fairly] cooperate” here. Alice could just as well expect Bob to be content with getting expected utility 0.2 from the outcome where he slows down and Alice speeds up — better that than the utility 0 from extinction, after all. Alice might think she can make it credible to Bob that she won’t back down from speeding up capabilities, and vice versa, such that they both end up pursuing incompatible demands. (See, e.g., “miscoordination” here.)
You’re lumping “(a) slow down capabilities and (b) tell your AI to adopt a compromise utility function” into one procedure. I guess the idea is that, ideally, the winner of the race could have their AI check whether the loser was committed to do both (a) and (b). But realistically it seems implausible to me that Alice or Bob can commit to (b) before winning the race, i.e., that what they do in the time before they win the race determines whether they’ll do (b). They can certainly tell themselves they intend to do (b), but that’s cheap talk.
So it seems Alice would likely think, “If I follow the whole procedure, Bob will cooperate with my values if I lose. But even if I slow down (do (a)), I don’t know if my future self [or, maybe more realistically, the other successors who might take power] will do (b) — indeed once they’re in that position, they’ll have no incentive to do (b). So slowing down isn’t clearly better.” (I do think, setting aside the bargaining problem in (1), she has an incentive to try to make it more likely that her successors follow (b), to be clear.)