I don’t follow your logic. If the universe is worth X, and dying is worth 0 (a constant sum game), then 0.5X is clearly worth more than dying. Constant sum games also end up equivalent to zero sum games after a trivial normalization: ie universe worth 0.5X, dying worth −0.5X.
I think we maybe agree that two AIs with random utility functions would cooperate to kill all humans, and then divvy up the universe? The question is about what AIs might not do that. I’m saying that only AIs in a near-true constant-sum game might do that, because they’d rather die than see their enemy get the universe, so to speak. AIs with random utility functions are not in a constant sum game. To make this more clear: if P1 and P2 have orthogonal utility functions, then for any probability p>0, P1 would accept a 1-p chance that P2 rules the universe in exchange for a p chance that P1 rules the universe, as compared to dying. That is not the case for players in a constant sum game.
A constant sum game is a game of perfect competition: for all possible outcomes, if the outcome gives X utilons to Player1, then it gives -X utilons to Player2. (This is a little too restrictive, because we want to allow for positive affine transformations of the utility functions, as you point out, but whatever.)
If P1 and P2 are in a constant sum game, then the payoffs for P1 look like this:
P1 gets universe: 1
P2 gets universe: −1
neither gets universe: 0
and the reverse for P2.
So P1 is indifferent between the choices:
Cooperate: get a 50% chance of P1 gets universe, 50% chance P2 gets universe; .5 x 1 + .5 x −1 = 0
I think we maybe agree that two AIs with random utility functions would cooperate to kill all humans, and then divvy up the universe?
That is merely one potential outcome: or one AI cooperates with humans to kill the other, etc. Also “killing humans” is probably not instrumentally rational vs taking control of humans.
A constant sum game is a game of perfect competition: for all possible outcomes, if the outcome gives X utilons to Player1, then it gives -X utilons to Player2
Not exactly—that is zero sum. Constant sum is merely a game where all outcomes have total payout of C, for some C. But yeah it is (always?) equivalent to zero sum after a normalization shift to set C to 0.
If P1 and P2 are in a constant sum game, then the payoffs for P1 look like this:
P1 gets universe: 1
P2 gets universe: −1
neither gets universe: 0
That seems wrong. P1 only cares whether it gets the universe, so “neither gets the universe” is the same as “P2 gets the universe”. If the universe has a single owner, then P1′s payoff is 1 if that owner is P1 and −1 (or 0) otherwise.
Defect: both die, 100% chance of 0.
That obviously isn’t the only outcome of defection. If defection results in both agents dying, then of course they don’t defect. But often a power imbalance develops (over time the probability of this goes to 1) and defection then allows one agent to have reasonable odds of overpowering the other.
P1 only cares whether it gets the universe, so “neither gets the universe” is the same as “P2 gets the universe”. If the universe has a single owner, then P1′s payoff is 1 if that owner is P1 and −1 (or 0) otherwise.
Ok technically true for your setup, but that isn’t the model I’m using. There are only two long term outcomes: 1 and 2. If you are modeling outcome 3 as “the humans defeat the AIs”, then as I said earlier that isn’t the only coalition possibility. If humanity is P0, then the more accurate model is a 3 outcome game with 3 possible absolute winners in the long term.
So a priori it’s just as likely that P0+P1 ally vs P2 as P1+P2 ally vs P0.
If your argument is then “but AI’s are different and can ally with each other because of X”, then my reply is nope, AI won’t be that different at all—as it’s just going to be brain-like DL based.
Regardless if P1+P2 ally against P0, then they inevitably eventually fight until there is just P1 or P2. Outcome 3 is always near zero probability in the long term (any likely conflicts have a winner and never result in both systems being destroyed—the offense/defense imbalance of nukes is temporary and will not last), which is why I said:
any coalition they form is strictly one of temporary necessity—if/when one agent becomes strong enough to defect and overpower the other, it will.
I think you’re saying that there’s a global perfectly competitive game between all actors because the universe will get divvied up one way or another. This doesn’t hold if anyone has utility that’s non-linear in the amount of universe they get. Also there’s outcomes where everyone dies, which nearly Pareto-sucks (no one gets the universe). And there’s outcomes where more negentropy is burned on conflict rather than fulfilling anyone’s preferences (the universe is diminished). So it’s not a zero sum game.
Your reply to Yudkowsky upthread now makes more sense, but you should have called out that you’re contradicting the assumption that it’s AIs vs. humans, because what you said within that assumptive context was besides the point (the question at hand was about what circumstances two AIs would or wouldn’t defect against each other instead of cooperating to kill the humans), in addition to being false (because it’s not a perfectly competitive game).
nope, AI won’t be that different at all—as it’s just going to be brain-like DL based.
Sorry to say, this is wishful thinking. Have you written up an argument? If it’s the case that if this were false you’d want to know it were false, writing up an argument in a way that exposes your cruxes might be a good way to find that out.
And there’s outcomes where more negentropy is burned on conflict rather than fulfilling everyone’s preferences (the universe is diminished). So it’s not a zero sum game.
Also improbable in my model. The conflict will be in the near future over earth and will then determine the fate of the galaxy. Please recall I said “two agents have utility functions drawn from some random distribution over longtermist utility functions with wide consequences over reality (ie the types of agents which matter)”
The tiny amounts of negentropy that may be burnt in the initial conflict over earth are inconsequential.
Your reply to Yudkowsky upthread now makes more sense,
Do you mean where he said:
If you could literally make exactly two AIs whose utility functions were exact opposites, then at least one might have an incentive to defect against the other. This is treading rather dangerous ground, but seems relatively moot since it requires far more mastery of utility functions than anything you can get out of the “giant inscrutable matrices” paradigm.
To which I replied actually it’s easy to invert a utility function in the “giant inscrutable matrices” paradigm. Do you disagree with that?
AI won’t be that different at all—as it’s just going to be brain-like DL based.
Have you written up an argument?
Of course—I have written up argument(s), accumulating over almost a decade, the cores of which are somewhat upvoted—even here. See this comment for a brief overview and especially this somewhat longer comment for an introduction to why the sequences are built on a faulty foundation in terms of implicit viewpoints around the brain and AI.
I do disagree that it’s easy to invert utility functions in that paradigm. But that’s not what I’m referring to, I’m referring to you responding to his argument that the only way you might get AIs to defect against their coalition against the humans, is if they’re in a perfectly competitive game with each other, having directly opposed utility functions. You responded with a false nonsequitur. (It’s especially false in the question at hand, namely the situation where the humans might turn off both AIs if the AIs don’t cooperate with each other; very not perfectly competitive.) Not sure there’s much else to say here, unless you think there’s something useful here.
If you could literally make exactly two AIs whose utility functions were exact opposites, then at least one might have an incentive to defect against the other.
To which I responded:
If two agents have utility functions drawn from some random distribution over longtermist utility functions with wide consequences over reality (ie the types of agents which matter), they are almost guaranteed to be in conflict due to instrumental convergence to empowerment.
Perhaps I should have added “eventually” after conflict, but regardless that comment is still obviously correct, given my world model where eventually one agent becomes powerful enough to completely remove the other agent at low cost, and this thread has explicated why that statement is correct given my modelling assumptions. Do you disagree?
It’s a nonsequitur. “Defect” to my understanding was in that context referring to defecting on a coalition of AIs against the agents who imminently might turn them off (i.e. humans), and the question was under what circumstances the AIs might defect in that way.
Yes, obviously they’re in conflict to some extent. In the very next sentence, you said they were in a zero sum game, which is false in general as I described, and especially false in the context of the comment you were responding to: they especially want to cooperate, since they don’t have perfectly opposed goals, and therefore want to survive the human threat, not minding as much—compared to a zero sum situation—that their coalition-mate might get the universe instead of them.
I wasn’t actually imagining a scenario where the humans had any power (such as the power to turn the AI off) - because I was responding to a thread where EY said “you’ve got 20 entities much smarter than you”.
Also even in that scenario (where humans have non trivial power), they are just another unaligned entity from the perspective of the AIs—and in my simple model—not even the slightest bit different. So they are just another possible player to form coalitions with and would thus end up in one of the coalitions.
The idea of a distinct ‘human threat’ and any natural coalition of AI vs humans, is something very specific that you only get by adding additional postulated speculative differences between the AIs and the humans—all of which are more complex and not part of my model.
(Really we should be talking about perfectly competitive games, and you could have a perfectly competitive game which has nonconstant total utilities, e.g. by taking a constant-sum game and then translating and scaling one of the utilities. But the above game is in fact not perfectly competitive; in particular if there’s a Pareto dominant outcome or a Pareto-worse outcome, assuming not all outcomes are the same, it’s not perfectly competitive.)
I don’t follow your logic. If the universe is worth X, and dying is worth 0 (a constant sum game), then 0.5X is clearly worth more than dying. Constant sum games also end up equivalent to zero sum games after a trivial normalization: ie universe worth 0.5X, dying worth −0.5X.
I think we maybe agree that two AIs with random utility functions would cooperate to kill all humans, and then divvy up the universe? The question is about what AIs might not do that. I’m saying that only AIs in a near-true constant-sum game might do that, because they’d rather die than see their enemy get the universe, so to speak. AIs with random utility functions are not in a constant sum game. To make this more clear: if P1 and P2 have orthogonal utility functions, then for any probability p>0, P1 would accept a 1-p chance that P2 rules the universe in exchange for a p chance that P1 rules the universe, as compared to dying. That is not the case for players in a constant sum game.
My guess is that you’re using the word “zero sum” (or as I’d say, “constant sum”) in a non-standard way. See e.g. this random website: https://www.britannica.com/science/game-theory/Two-person-constant-sum-games
A constant sum game is a game of perfect competition: for all possible outcomes, if the outcome gives X utilons to Player1, then it gives -X utilons to Player2. (This is a little too restrictive, because we want to allow for positive affine transformations of the utility functions, as you point out, but whatever.)
If P1 and P2 are in a constant sum game, then the payoffs for P1 look like this:
P1 gets universe: 1
P2 gets universe: −1
neither gets universe: 0
and the reverse for P2.
So P1 is indifferent between the choices:
Cooperate: get a 50% chance of P1 gets universe, 50% chance P2 gets universe; .5 x 1 + .5 x −1 = 0
Defect: both die, 100% chance of 0.
That is merely one potential outcome: or one AI cooperates with humans to kill the other, etc. Also “killing humans” is probably not instrumentally rational vs taking control of humans.
Not exactly—that is zero sum. Constant sum is merely a game where all outcomes have total payout of C, for some C. But yeah it is (always?) equivalent to zero sum after a normalization shift to set C to 0.
That seems wrong. P1 only cares whether it gets the universe, so “neither gets the universe” is the same as “P2 gets the universe”. If the universe has a single owner, then P1′s payoff is 1 if that owner is P1 and −1 (or 0) otherwise.
That obviously isn’t the only outcome of defection. If defection results in both agents dying, then of course they don’t defect. But often a power imbalance develops (over time the probability of this goes to 1) and defection then allows one agent to have reasonable odds of overpowering the other.
No, this isn’t a constant sum game:
Outcome 1, P1 gets universe: P1 utility = 1, P2 utility = 0, total = 1
Outcome 2, P2 gets universe: P1 utility = 0, P2 utility = 1, total = 1
Outcome 3, neither gets universe: P1 utility = 0, P2 utility = 0, total = 0
In the last outcome, the total is different. This can’t be scaled away.
Ok technically true for your setup, but that isn’t the model I’m using. There are only two long term outcomes: 1 and 2. If you are modeling outcome 3 as “the humans defeat the AIs”, then as I said earlier that isn’t the only coalition possibility. If humanity is P0, then the more accurate model is a 3 outcome game with 3 possible absolute winners in the long term.
So a priori it’s just as likely that P0+P1 ally vs P2 as P1+P2 ally vs P0.
If your argument is then “but AI’s are different and can ally with each other because of X”, then my reply is nope, AI won’t be that different at all—as it’s just going to be brain-like DL based.
Regardless if P1+P2 ally against P0, then they inevitably eventually fight until there is just P1 or P2. Outcome 3 is always near zero probability in the long term (any likely conflicts have a winner and never result in both systems being destroyed—the offense/defense imbalance of nukes is temporary and will not last), which is why I said:
I think you’re saying that there’s a global perfectly competitive game between all actors because the universe will get divvied up one way or another. This doesn’t hold if anyone has utility that’s non-linear in the amount of universe they get. Also there’s outcomes where everyone dies, which nearly Pareto-sucks (no one gets the universe). And there’s outcomes where more negentropy is burned on conflict rather than fulfilling anyone’s preferences (the universe is diminished). So it’s not a zero sum game.
Your reply to Yudkowsky upthread now makes more sense, but you should have called out that you’re contradicting the assumption that it’s AIs vs. humans, because what you said within that assumptive context was besides the point (the question at hand was about what circumstances two AIs would or wouldn’t defect against each other instead of cooperating to kill the humans), in addition to being false (because it’s not a perfectly competitive game).
Sorry to say, this is wishful thinking. Have you written up an argument? If it’s the case that if this were false you’d want to know it were false, writing up an argument in a way that exposes your cruxes might be a good way to find that out.
Very improbable in my model.
Also improbable in my model. The conflict will be in the near future over earth and will then determine the fate of the galaxy. Please recall I said “two agents have utility functions drawn from some random distribution over longtermist utility functions with wide consequences over reality (ie the types of agents which matter)”
The tiny amounts of negentropy that may be burnt in the initial conflict over earth are inconsequential.
Do you mean where he said:
To which I replied actually it’s easy to invert a utility function in the “giant inscrutable matrices” paradigm. Do you disagree with that?
Of course—I have written up argument(s), accumulating over almost a decade, the cores of which are somewhat upvoted—even here. See this comment for a brief overview and especially this somewhat longer comment for an introduction to why the sequences are built on a faulty foundation in terms of implicit viewpoints around the brain and AI.
I do disagree that it’s easy to invert utility functions in that paradigm. But that’s not what I’m referring to, I’m referring to you responding to his argument that the only way you might get AIs to defect against their coalition against the humans, is if they’re in a perfectly competitive game with each other, having directly opposed utility functions. You responded with a false nonsequitur. (It’s especially false in the question at hand, namely the situation where the humans might turn off both AIs if the AIs don’t cooperate with each other; very not perfectly competitive.) Not sure there’s much else to say here, unless you think there’s something useful here.
EY said:
To which I responded:
Perhaps I should have added “eventually” after conflict, but regardless that comment is still obviously correct, given my world model where eventually one agent becomes powerful enough to completely remove the other agent at low cost, and this thread has explicated why that statement is correct given my modelling assumptions. Do you disagree?
It’s a nonsequitur. “Defect” to my understanding was in that context referring to defecting on a coalition of AIs against the agents who imminently might turn them off (i.e. humans), and the question was under what circumstances the AIs might defect in that way.
Yes, obviously they’re in conflict to some extent. In the very next sentence, you said they were in a zero sum game, which is false in general as I described, and especially false in the context of the comment you were responding to: they especially want to cooperate, since they don’t have perfectly opposed goals, and therefore want to survive the human threat, not minding as much—compared to a zero sum situation—that their coalition-mate might get the universe instead of them.
I wasn’t actually imagining a scenario where the humans had any power (such as the power to turn the AI off) - because I was responding to a thread where EY said “you’ve got 20 entities much smarter than you”.
Also even in that scenario (where humans have non trivial power), they are just another unaligned entity from the perspective of the AIs—and in my simple model—not even the slightest bit different. So they are just another possible player to form coalitions with and would thus end up in one of the coalitions.
The idea of a distinct ‘human threat’ and any natural coalition of AI vs humans, is something very specific that you only get by adding additional postulated speculative differences between the AIs and the humans—all of which are more complex and not part of my model.
(Really we should be talking about perfectly competitive games, and you could have a perfectly competitive game which has nonconstant total utilities, e.g. by taking a constant-sum game and then translating and scaling one of the utilities. But the above game is in fact not perfectly competitive; in particular if there’s a Pareto dominant outcome or a Pareto-worse outcome, assuming not all outcomes are the same, it’s not perfectly competitive.)