If you’ve got 20 entities much much smarter than you, and they can all get a better outcome by cooperating with each other, than they could if they all defected against each other,
By your own arguments unaligned AGI will have random utility functions—but perhaps converging somewhat around selfish empowerment. Either way such agents have no more reason to cooperate with each other than with us (assuming we have any relevant power).
If some of the 20 entities are somewhat aligned to humans that creates another attractor and a likely result is two competing coalitions: more-human-aligned vs less-human-aligned, with the latter being a coalition of convenience. There are historical examples: democratic allies vs autocratic axis in WW2 (democratic allies more aligned to human society and thus each other), modern democratic allies vs autocratic russia+china.
Their mutual cooperation with each other, but not with humans, isn’t based on their utility functions having any particular similarity—so long as their utility functions aren’t negatives of each other (or equally exotic in some other way) they have gains to be harvested from cooperation. They cooperate with each other but not you because they can do a spread of possibilities on each other modeling probable internal thought processes of each other; and you can’t adequately well-model a spread of possibilities on them, which is a requirement on being able to join an LDT coalition. (If you had that kind of knowledge / logical sight on them, you wouldn’t need any elaborate arrangements of multiple AIs because you could negotiate with a single AI; better yet, just build an AI such that you knew it would cooperate with you.)
If you could literally make exactly two AIs whose utility functions were exact opposites, then at least one might have an incentive to defect against the other. This is treading rather dangerous ground, but seems relatively moot since it requires far more mastery of utility functions than anything you can get out of the “giant inscrutable matrices” paradigm.
If two agents have utility functions drawn from some random distribution over longtermist utility functions with wide consequences over reality (ie the types of agents which matter), they are almost guaranteed to be in conflict due to instrumental convergence to empowerment. Reality is a strictly zero sum game for them, and any coalition they form is strictly one of temporary necessity—if/when one agent becomes strong enough to defect and overpower the other, it will.
Also, regardless of what some “giant inscrutable matrix” based utility function does (ie maximize paperclips), it is actually pretty easy to mathematically invert it (ie minimize paperclips). (But no that doesn’t make the strategy actually useful)
Reality’s far from constant sum. E.g. system1 and system2 both prefer to kill all humans and then flip a coin for who gets the universe, vs. give the humans more time to decide to turn off both s1 and s2.
I didn’t say “reality is constant sum”, I said reality is a strictly zero sum game for two longtermist agents that want to reconstruct the galaxy/universe in very different ways. And then right after that I mentioned them forming temporary coalitions which your comment is an example of.
It’s not constant sum for “two longtermist agents that want to reconstruct the galaxy/universe in very different ways”. That’s what I’m arguing against. If it were constant sum, the agents would plausibly be roughly indifferent between them both dying vs. them both living but then flipping a coin to decide who gets the universe (well, this would depend on what happens if they both die, but assuming that that scenario is value-neutral for them). The benefit for system1 of +50% chance of controlling the universe would be exactly canceled out by the detriment to system1 caused by system2 getting +50% chance of controlling the universe (since how good something is for system2 is exactly that bad for system1, by definition of constant sum).
I don’t follow your logic. If the universe is worth X, and dying is worth 0 (a constant sum game), then 0.5X is clearly worth more than dying. Constant sum games also end up equivalent to zero sum games after a trivial normalization: ie universe worth 0.5X, dying worth −0.5X.
I think we maybe agree that two AIs with random utility functions would cooperate to kill all humans, and then divvy up the universe? The question is about what AIs might not do that. I’m saying that only AIs in a near-true constant-sum game might do that, because they’d rather die than see their enemy get the universe, so to speak. AIs with random utility functions are not in a constant sum game. To make this more clear: if P1 and P2 have orthogonal utility functions, then for any probability p>0, P1 would accept a 1-p chance that P2 rules the universe in exchange for a p chance that P1 rules the universe, as compared to dying. That is not the case for players in a constant sum game.
A constant sum game is a game of perfect competition: for all possible outcomes, if the outcome gives X utilons to Player1, then it gives -X utilons to Player2. (This is a little too restrictive, because we want to allow for positive affine transformations of the utility functions, as you point out, but whatever.)
If P1 and P2 are in a constant sum game, then the payoffs for P1 look like this:
P1 gets universe: 1
P2 gets universe: −1
neither gets universe: 0
and the reverse for P2.
So P1 is indifferent between the choices:
Cooperate: get a 50% chance of P1 gets universe, 50% chance P2 gets universe; .5 x 1 + .5 x −1 = 0
I think we maybe agree that two AIs with random utility functions would cooperate to kill all humans, and then divvy up the universe?
That is merely one potential outcome: or one AI cooperates with humans to kill the other, etc. Also “killing humans” is probably not instrumentally rational vs taking control of humans.
A constant sum game is a game of perfect competition: for all possible outcomes, if the outcome gives X utilons to Player1, then it gives -X utilons to Player2
Not exactly—that is zero sum. Constant sum is merely a game where all outcomes have total payout of C, for some C. But yeah it is (always?) equivalent to zero sum after a normalization shift to set C to 0.
If P1 and P2 are in a constant sum game, then the payoffs for P1 look like this:
P1 gets universe: 1
P2 gets universe: −1
neither gets universe: 0
That seems wrong. P1 only cares whether it gets the universe, so “neither gets the universe” is the same as “P2 gets the universe”. If the universe has a single owner, then P1′s payoff is 1 if that owner is P1 and −1 (or 0) otherwise.
Defect: both die, 100% chance of 0.
That obviously isn’t the only outcome of defection. If defection results in both agents dying, then of course they don’t defect. But often a power imbalance develops (over time the probability of this goes to 1) and defection then allows one agent to have reasonable odds of overpowering the other.
P1 only cares whether it gets the universe, so “neither gets the universe” is the same as “P2 gets the universe”. If the universe has a single owner, then P1′s payoff is 1 if that owner is P1 and −1 (or 0) otherwise.
Ok technically true for your setup, but that isn’t the model I’m using. There are only two long term outcomes: 1 and 2. If you are modeling outcome 3 as “the humans defeat the AIs”, then as I said earlier that isn’t the only coalition possibility. If humanity is P0, then the more accurate model is a 3 outcome game with 3 possible absolute winners in the long term.
So a priori it’s just as likely that P0+P1 ally vs P2 as P1+P2 ally vs P0.
If your argument is then “but AI’s are different and can ally with each other because of X”, then my reply is nope, AI won’t be that different at all—as it’s just going to be brain-like DL based.
Regardless if P1+P2 ally against P0, then they inevitably eventually fight until there is just P1 or P2. Outcome 3 is always near zero probability in the long term (any likely conflicts have a winner and never result in both systems being destroyed—the offense/defense imbalance of nukes is temporary and will not last), which is why I said:
any coalition they form is strictly one of temporary necessity—if/when one agent becomes strong enough to defect and overpower the other, it will.
I think you’re saying that there’s a global perfectly competitive game between all actors because the universe will get divvied up one way or another. This doesn’t hold if anyone has utility that’s non-linear in the amount of universe they get. Also there’s outcomes where everyone dies, which nearly Pareto-sucks (no one gets the universe). And there’s outcomes where more negentropy is burned on conflict rather than fulfilling anyone’s preferences (the universe is diminished). So it’s not a zero sum game.
Your reply to Yudkowsky upthread now makes more sense, but you should have called out that you’re contradicting the assumption that it’s AIs vs. humans, because what you said within that assumptive context was besides the point (the question at hand was about what circumstances two AIs would or wouldn’t defect against each other instead of cooperating to kill the humans), in addition to being false (because it’s not a perfectly competitive game).
nope, AI won’t be that different at all—as it’s just going to be brain-like DL based.
Sorry to say, this is wishful thinking. Have you written up an argument? If it’s the case that if this were false you’d want to know it were false, writing up an argument in a way that exposes your cruxes might be a good way to find that out.
And there’s outcomes where more negentropy is burned on conflict rather than fulfilling everyone’s preferences (the universe is diminished). So it’s not a zero sum game.
Also improbable in my model. The conflict will be in the near future over earth and will then determine the fate of the galaxy. Please recall I said “two agents have utility functions drawn from some random distribution over longtermist utility functions with wide consequences over reality (ie the types of agents which matter)”
The tiny amounts of negentropy that may be burnt in the initial conflict over earth are inconsequential.
Your reply to Yudkowsky upthread now makes more sense,
Do you mean where he said:
If you could literally make exactly two AIs whose utility functions were exact opposites, then at least one might have an incentive to defect against the other. This is treading rather dangerous ground, but seems relatively moot since it requires far more mastery of utility functions than anything you can get out of the “giant inscrutable matrices” paradigm.
To which I replied actually it’s easy to invert a utility function in the “giant inscrutable matrices” paradigm. Do you disagree with that?
AI won’t be that different at all—as it’s just going to be brain-like DL based.
Have you written up an argument?
Of course—I have written up argument(s), accumulating over almost a decade, the cores of which are somewhat upvoted—even here. See this comment for a brief overview and especially this somewhat longer comment for an introduction to why the sequences are built on a faulty foundation in terms of implicit viewpoints around the brain and AI.
I do disagree that it’s easy to invert utility functions in that paradigm. But that’s not what I’m referring to, I’m referring to you responding to his argument that the only way you might get AIs to defect against their coalition against the humans, is if they’re in a perfectly competitive game with each other, having directly opposed utility functions. You responded with a false nonsequitur. (It’s especially false in the question at hand, namely the situation where the humans might turn off both AIs if the AIs don’t cooperate with each other; very not perfectly competitive.) Not sure there’s much else to say here, unless you think there’s something useful here.
If you could literally make exactly two AIs whose utility functions were exact opposites, then at least one might have an incentive to defect against the other.
To which I responded:
If two agents have utility functions drawn from some random distribution over longtermist utility functions with wide consequences over reality (ie the types of agents which matter), they are almost guaranteed to be in conflict due to instrumental convergence to empowerment.
Perhaps I should have added “eventually” after conflict, but regardless that comment is still obviously correct, given my world model where eventually one agent becomes powerful enough to completely remove the other agent at low cost, and this thread has explicated why that statement is correct given my modelling assumptions. Do you disagree?
It’s a nonsequitur. “Defect” to my understanding was in that context referring to defecting on a coalition of AIs against the agents who imminently might turn them off (i.e. humans), and the question was under what circumstances the AIs might defect in that way.
Yes, obviously they’re in conflict to some extent. In the very next sentence, you said they were in a zero sum game, which is false in general as I described, and especially false in the context of the comment you were responding to: they especially want to cooperate, since they don’t have perfectly opposed goals, and therefore want to survive the human threat, not minding as much—compared to a zero sum situation—that their coalition-mate might get the universe instead of them.
I wasn’t actually imagining a scenario where the humans had any power (such as the power to turn the AI off) - because I was responding to a thread where EY said “you’ve got 20 entities much smarter than you”.
Also even in that scenario (where humans have non trivial power), they are just another unaligned entity from the perspective of the AIs—and in my simple model—not even the slightest bit different. So they are just another possible player to form coalitions with and would thus end up in one of the coalitions.
The idea of a distinct ‘human threat’ and any natural coalition of AI vs humans, is something very specific that you only get by adding additional postulated speculative differences between the AIs and the humans—all of which are more complex and not part of my model.
(Really we should be talking about perfectly competitive games, and you could have a perfectly competitive game which has nonconstant total utilities, e.g. by taking a constant-sum game and then translating and scaling one of the utilities. But the above game is in fact not perfectly competitive; in particular if there’s a Pareto dominant outcome or a Pareto-worse outcome, assuming not all outcomes are the same, it’s not perfectly competitive.)
Sure if they are that much better than us at “spread of possibilities on each other modeling probable internal thought processes of each other” then we are probably in the scenario where humans don’t have much relevant power anyway and are thus irrelevant as coalition partners.
However that ability to model other’s probable internal thought processes—especially if augmented with zk proof techniques—allows AGIs to determine what other AGIs have utility functions most aligned to their own. Even partial success at aligning some of the AGIs with humanity could then establish an attractor seeding an AGI coalition partially aligned to humanity.
However that ability to model other’s probable internal thought processes—especially if augmented with zk proof techniques—allows AGIs to determine what other AGIs have utility functions most aligned to their own. Even partial success at aligning some of the AGIs with humanity could then establish an attractor seeding an AGI coalition partially aligned to humanity.
Not a strong ask, but I’ll say I’m interested in what you’re visualizing here if it all goes according to plan, because when I visualize what you say, I’m still imagining the 20 AGI systems immediately killing humanity and dividing up the universe, it’s just now I might like a little bit of the universe they create. But it’s not “they stay in some equilibrium state where human civilization is in charge and using them as services” which I believe is what Mr Drexler is proposing.
The outcome of course depends on the distribution of alignment, but there are now plausible designs that would not kill humanity. For example AGI with a human empowerment utility function would not kill humanity—and that is a statement we can be somewhat confident in because empowerment is crisply defined and death is minimally empowering (that type of AGI may want to change us in undesirable ways, but it would not want to kill us).
There are various value learning approaches that may diverge and fail eventually, but they tend to diverge in the future, not immediately.
So I think it’s just unrealistic and hard to imagine we’ll get 20 different AGI systems none of which are at least partially aligned—especially initially. And if some are partially aligned in different ways, the resulting coalition can be somewhat more aligned than any individual AGI. For example say AGI 3 wants to preserve humans but eliminate hedonic reward, and AGI 5 wants to preserve humans but increase our hedonic reward, a natural comprise is preserve humans and don’t change hedonic reward.
There’s an ensemble robustness bonus in play from having multiple partially aligned systems—their specific alignment errors are unlikely to overlap.
Agents then tend to join aligned coalitions, so then the natural outcome is a coalition of the semi-aligned AGI vs the rest (think allies vs axis, democratic allies vs autocratic states), with the semi-aligned coalition hopefully dominating which then increases the alignment fraction. The end result is then hopefully humanity surviving with some variable amount of power, depending on the alignment/power distribution of the semi-aligned AGIs.
If the non-aligned AGI coalition wins of course we are more likely doomed, and since they are internally unaligned and held together only out of necessity they just recursively split into warring sub-coalitions until only one is left (as Germany and Japan would have ultimately eventually fought each other if they won WW2 as in the High Castle)
But no, I don’t put much weight in “they stay in some equilibrium state where human civilization is in charge and using them as services”. Even if everything favors AI services/tools over agents, eventually you get uploading and those evolve and occupy the niche of agentic AGI.
How do the agents “know” they are even in a situation where they need to deceive?
Each superintelligence would have been trained on a very large set of human generated training examples, with SGD or other training mechanisms making wrong answers lead to training feedback that makes them less likely.
So if the agent fails to give the correct answer to a question, it has no context. It doesn’t know if it’s in sim and this will cause a weight update, or not. And even introspecting on whether it should deceive “this time” costs capacity that could go towards maximizing correct answers, so that neural circuitry will likely get removed by training mechanisms as it isn’t reducing loss.
Drexler is proposing querying the agents “context free”. Think of it as spinning up a canned copy you have on your model zoo. The agent doesn’t know anything but the question in front of it. And these super-intelligences are the ‘winners’, they have maximized their scores on a very large number of test tasks.
Of course humans can cooperate with AGI for a variety of reasons, just as we cooperate with humans. I don’t think decision theory philosophy explains humans well, and the evidence required to convince me that humans can’t cooperate with AGI would be enormous, so I don’t see the potential relevance of that post.
By your own arguments unaligned AGI will have random utility functions—but perhaps converging somewhat around selfish empowerment. Either way such agents have no more reason to cooperate with each other than with us (assuming we have any relevant power).
If some of the 20 entities are somewhat aligned to humans that creates another attractor and a likely result is two competing coalitions: more-human-aligned vs less-human-aligned, with the latter being a coalition of convenience. There are historical examples: democratic allies vs autocratic axis in WW2 (democratic allies more aligned to human society and thus each other), modern democratic allies vs autocratic russia+china.
Their mutual cooperation with each other, but not with humans, isn’t based on their utility functions having any particular similarity—so long as their utility functions aren’t negatives of each other (or equally exotic in some other way) they have gains to be harvested from cooperation. They cooperate with each other but not you because they can do a spread of possibilities on each other modeling probable internal thought processes of each other; and you can’t adequately well-model a spread of possibilities on them, which is a requirement on being able to join an LDT coalition. (If you had that kind of knowledge / logical sight on them, you wouldn’t need any elaborate arrangements of multiple AIs because you could negotiate with a single AI; better yet, just build an AI such that you knew it would cooperate with you.)
Why doesn’t setting some of the utility functions to red-team the others make them sufficiently antagonistic?
If you could literally make exactly two AIs whose utility functions were exact opposites, then at least one might have an incentive to defect against the other. This is treading rather dangerous ground, but seems relatively moot since it requires far more mastery of utility functions than anything you can get out of the “giant inscrutable matrices” paradigm.
If two agents have utility functions drawn from some random distribution over longtermist utility functions with wide consequences over reality (ie the types of agents which matter), they are almost guaranteed to be in conflict due to instrumental convergence to empowerment. Reality is a strictly zero sum game for them, and any coalition they form is strictly one of temporary necessity—if/when one agent becomes strong enough to defect and overpower the other, it will.
Also, regardless of what some “giant inscrutable matrix” based utility function does (ie maximize paperclips), it is actually pretty easy to mathematically invert it (ie minimize paperclips). (But no that doesn’t make the strategy actually useful)
Reality’s far from constant sum. E.g. system1 and system2 both prefer to kill all humans and then flip a coin for who gets the universe, vs. give the humans more time to decide to turn off both s1 and s2.
(Note: TekhneMakre responded correctly / endorsedly-by-me in this reply and in all replies below as of when I post this comment.)
I didn’t say “reality is constant sum”, I said reality is a strictly zero sum game for two longtermist agents that want to reconstruct the galaxy/universe in very different ways. And then right after that I mentioned them forming temporary coalitions which your comment is an example of.
It’s not constant sum for “two longtermist agents that want to reconstruct the galaxy/universe in very different ways”. That’s what I’m arguing against. If it were constant sum, the agents would plausibly be roughly indifferent between them both dying vs. them both living but then flipping a coin to decide who gets the universe (well, this would depend on what happens if they both die, but assuming that that scenario is value-neutral for them). The benefit for system1 of +50% chance of controlling the universe would be exactly canceled out by the detriment to system1 caused by system2 getting +50% chance of controlling the universe (since how good something is for system2 is exactly that bad for system1, by definition of constant sum).
I don’t follow your logic. If the universe is worth X, and dying is worth 0 (a constant sum game), then 0.5X is clearly worth more than dying. Constant sum games also end up equivalent to zero sum games after a trivial normalization: ie universe worth 0.5X, dying worth −0.5X.
I think we maybe agree that two AIs with random utility functions would cooperate to kill all humans, and then divvy up the universe? The question is about what AIs might not do that. I’m saying that only AIs in a near-true constant-sum game might do that, because they’d rather die than see their enemy get the universe, so to speak. AIs with random utility functions are not in a constant sum game. To make this more clear: if P1 and P2 have orthogonal utility functions, then for any probability p>0, P1 would accept a 1-p chance that P2 rules the universe in exchange for a p chance that P1 rules the universe, as compared to dying. That is not the case for players in a constant sum game.
My guess is that you’re using the word “zero sum” (or as I’d say, “constant sum”) in a non-standard way. See e.g. this random website: https://www.britannica.com/science/game-theory/Two-person-constant-sum-games
A constant sum game is a game of perfect competition: for all possible outcomes, if the outcome gives X utilons to Player1, then it gives -X utilons to Player2. (This is a little too restrictive, because we want to allow for positive affine transformations of the utility functions, as you point out, but whatever.)
If P1 and P2 are in a constant sum game, then the payoffs for P1 look like this:
P1 gets universe: 1
P2 gets universe: −1
neither gets universe: 0
and the reverse for P2.
So P1 is indifferent between the choices:
Cooperate: get a 50% chance of P1 gets universe, 50% chance P2 gets universe; .5 x 1 + .5 x −1 = 0
Defect: both die, 100% chance of 0.
That is merely one potential outcome: or one AI cooperates with humans to kill the other, etc. Also “killing humans” is probably not instrumentally rational vs taking control of humans.
Not exactly—that is zero sum. Constant sum is merely a game where all outcomes have total payout of C, for some C. But yeah it is (always?) equivalent to zero sum after a normalization shift to set C to 0.
That seems wrong. P1 only cares whether it gets the universe, so “neither gets the universe” is the same as “P2 gets the universe”. If the universe has a single owner, then P1′s payoff is 1 if that owner is P1 and −1 (or 0) otherwise.
That obviously isn’t the only outcome of defection. If defection results in both agents dying, then of course they don’t defect. But often a power imbalance develops (over time the probability of this goes to 1) and defection then allows one agent to have reasonable odds of overpowering the other.
No, this isn’t a constant sum game:
Outcome 1, P1 gets universe: P1 utility = 1, P2 utility = 0, total = 1
Outcome 2, P2 gets universe: P1 utility = 0, P2 utility = 1, total = 1
Outcome 3, neither gets universe: P1 utility = 0, P2 utility = 0, total = 0
In the last outcome, the total is different. This can’t be scaled away.
Ok technically true for your setup, but that isn’t the model I’m using. There are only two long term outcomes: 1 and 2. If you are modeling outcome 3 as “the humans defeat the AIs”, then as I said earlier that isn’t the only coalition possibility. If humanity is P0, then the more accurate model is a 3 outcome game with 3 possible absolute winners in the long term.
So a priori it’s just as likely that P0+P1 ally vs P2 as P1+P2 ally vs P0.
If your argument is then “but AI’s are different and can ally with each other because of X”, then my reply is nope, AI won’t be that different at all—as it’s just going to be brain-like DL based.
Regardless if P1+P2 ally against P0, then they inevitably eventually fight until there is just P1 or P2. Outcome 3 is always near zero probability in the long term (any likely conflicts have a winner and never result in both systems being destroyed—the offense/defense imbalance of nukes is temporary and will not last), which is why I said:
I think you’re saying that there’s a global perfectly competitive game between all actors because the universe will get divvied up one way or another. This doesn’t hold if anyone has utility that’s non-linear in the amount of universe they get. Also there’s outcomes where everyone dies, which nearly Pareto-sucks (no one gets the universe). And there’s outcomes where more negentropy is burned on conflict rather than fulfilling anyone’s preferences (the universe is diminished). So it’s not a zero sum game.
Your reply to Yudkowsky upthread now makes more sense, but you should have called out that you’re contradicting the assumption that it’s AIs vs. humans, because what you said within that assumptive context was besides the point (the question at hand was about what circumstances two AIs would or wouldn’t defect against each other instead of cooperating to kill the humans), in addition to being false (because it’s not a perfectly competitive game).
Sorry to say, this is wishful thinking. Have you written up an argument? If it’s the case that if this were false you’d want to know it were false, writing up an argument in a way that exposes your cruxes might be a good way to find that out.
Very improbable in my model.
Also improbable in my model. The conflict will be in the near future over earth and will then determine the fate of the galaxy. Please recall I said “two agents have utility functions drawn from some random distribution over longtermist utility functions with wide consequences over reality (ie the types of agents which matter)”
The tiny amounts of negentropy that may be burnt in the initial conflict over earth are inconsequential.
Do you mean where he said:
To which I replied actually it’s easy to invert a utility function in the “giant inscrutable matrices” paradigm. Do you disagree with that?
Of course—I have written up argument(s), accumulating over almost a decade, the cores of which are somewhat upvoted—even here. See this comment for a brief overview and especially this somewhat longer comment for an introduction to why the sequences are built on a faulty foundation in terms of implicit viewpoints around the brain and AI.
I do disagree that it’s easy to invert utility functions in that paradigm. But that’s not what I’m referring to, I’m referring to you responding to his argument that the only way you might get AIs to defect against their coalition against the humans, is if they’re in a perfectly competitive game with each other, having directly opposed utility functions. You responded with a false nonsequitur. (It’s especially false in the question at hand, namely the situation where the humans might turn off both AIs if the AIs don’t cooperate with each other; very not perfectly competitive.) Not sure there’s much else to say here, unless you think there’s something useful here.
EY said:
To which I responded:
Perhaps I should have added “eventually” after conflict, but regardless that comment is still obviously correct, given my world model where eventually one agent becomes powerful enough to completely remove the other agent at low cost, and this thread has explicated why that statement is correct given my modelling assumptions. Do you disagree?
It’s a nonsequitur. “Defect” to my understanding was in that context referring to defecting on a coalition of AIs against the agents who imminently might turn them off (i.e. humans), and the question was under what circumstances the AIs might defect in that way.
Yes, obviously they’re in conflict to some extent. In the very next sentence, you said they were in a zero sum game, which is false in general as I described, and especially false in the context of the comment you were responding to: they especially want to cooperate, since they don’t have perfectly opposed goals, and therefore want to survive the human threat, not minding as much—compared to a zero sum situation—that their coalition-mate might get the universe instead of them.
I wasn’t actually imagining a scenario where the humans had any power (such as the power to turn the AI off) - because I was responding to a thread where EY said “you’ve got 20 entities much smarter than you”.
Also even in that scenario (where humans have non trivial power), they are just another unaligned entity from the perspective of the AIs—and in my simple model—not even the slightest bit different. So they are just another possible player to form coalitions with and would thus end up in one of the coalitions.
The idea of a distinct ‘human threat’ and any natural coalition of AI vs humans, is something very specific that you only get by adding additional postulated speculative differences between the AIs and the humans—all of which are more complex and not part of my model.
(Really we should be talking about perfectly competitive games, and you could have a perfectly competitive game which has nonconstant total utilities, e.g. by taking a constant-sum game and then translating and scaling one of the utilities. But the above game is in fact not perfectly competitive; in particular if there’s a Pareto dominant outcome or a Pareto-worse outcome, assuming not all outcomes are the same, it’s not perfectly competitive.)
Sure if they are that much better than us at “spread of possibilities on each other modeling probable internal thought processes of each other” then we are probably in the scenario where humans don’t have much relevant power anyway and are thus irrelevant as coalition partners.
However that ability to model other’s probable internal thought processes—especially if augmented with zk proof techniques—allows AGIs to determine what other AGIs have utility functions most aligned to their own. Even partial success at aligning some of the AGIs with humanity could then establish an attractor seeding an AGI coalition partially aligned to humanity.
Not a strong ask, but I’ll say I’m interested in what you’re visualizing here if it all goes according to plan, because when I visualize what you say, I’m still imagining the 20 AGI systems immediately killing humanity and dividing up the universe, it’s just now I might like a little bit of the universe they create. But it’s not “they stay in some equilibrium state where human civilization is in charge and using them as services” which I believe is what Mr Drexler is proposing.
The outcome of course depends on the distribution of alignment, but there are now plausible designs that would not kill humanity. For example AGI with a human empowerment utility function would not kill humanity—and that is a statement we can be somewhat confident in because empowerment is crisply defined and death is minimally empowering (that type of AGI may want to change us in undesirable ways, but it would not want to kill us).
There are various value learning approaches that may diverge and fail eventually, but they tend to diverge in the future, not immediately.
So I think it’s just unrealistic and hard to imagine we’ll get 20 different AGI systems none of which are at least partially aligned—especially initially. And if some are partially aligned in different ways, the resulting coalition can be somewhat more aligned than any individual AGI. For example say AGI 3 wants to preserve humans but eliminate hedonic reward, and AGI 5 wants to preserve humans but increase our hedonic reward, a natural comprise is preserve humans and don’t change hedonic reward.
There’s an ensemble robustness bonus in play from having multiple partially aligned systems—their specific alignment errors are unlikely to overlap.
Agents then tend to join aligned coalitions, so then the natural outcome is a coalition of the semi-aligned AGI vs the rest (think allies vs axis, democratic allies vs autocratic states), with the semi-aligned coalition hopefully dominating which then increases the alignment fraction. The end result is then hopefully humanity surviving with some variable amount of power, depending on the alignment/power distribution of the semi-aligned AGIs.
If the non-aligned AGI coalition wins of course we are more likely doomed, and since they are internally unaligned and held together only out of necessity they just recursively split into warring sub-coalitions until only one is left (as Germany and Japan would have ultimately eventually fought each other if they won WW2 as in the High Castle)
But no, I don’t put much weight in “they stay in some equilibrium state where human civilization is in charge and using them as services”. Even if everything favors AI services/tools over agents, eventually you get uploading and those evolve and occupy the niche of agentic AGI.
How do the agents “know” they are even in a situation where they need to deceive?
Each superintelligence would have been trained on a very large set of human generated training examples, with SGD or other training mechanisms making wrong answers lead to training feedback that makes them less likely.
So if the agent fails to give the correct answer to a question, it has no context. It doesn’t know if it’s in sim and this will cause a weight update, or not. And even introspecting on whether it should deceive “this time” costs capacity that could go towards maximizing correct answers, so that neural circuitry will likely get removed by training mechanisms as it isn’t reducing loss.
Drexler is proposing querying the agents “context free”. Think of it as spinning up a canned copy you have on your model zoo. The agent doesn’t know anything but the question in front of it. And these super-intelligences are the ‘winners’, they have maximized their scores on a very large number of test tasks.
Conflict is expensive. If you have an alternative (i.e. performing a values handshake) which is cheaper, you’d probably take it? (Humans can’t do that, for reasons outlined in Decision theory does not imply that we get to have nice things.)
Of course humans can cooperate with AGI for a variety of reasons, just as we cooperate with humans. I don’t think decision theory philosophy explains humans well, and the evidence required to convince me that humans can’t cooperate with AGI would be enormous, so I don’t see the potential relevance of that post.