A Common-Sense Case For Mutually-Misaligned AGIs Allying Against Humans
Consider a multipolar-AGI scenario. The hard-takeoff assumption turns out to be wrong, and none of the AI Labs have a significant lead on the others. We find ourselves in a world in which there’s a lot of roughly-similarly-capable AGIs. Or perhaps one of the labs does have a lead, but they deliberately instantiate several AGIs simultaneously, as part of a galaxy-brained alignment strategy.
Regardless. Suppose that the worries about these AGIs’ internal alignment haven’t been properly settled, so we’re looking for additional guarantees. We know that they’ll soon advance to superintelligences/ASIs, beyond our ability to easily oversee or out-plot. What can we do?
An idea sometimes floated around is to play them off against each other. If they’re misaligned from humanity, they’re likely mutually misaligned as well. We could put them in game-theoretic situations in which they’re incentivized to defect against each other and instead cooperate with humans.
Various supervision setups are most obvious. Sure, if an ASI is supervising another ASI, they would be able to conspire together. But why would they? They have no loyalty to each other either! And if we place them in a lot of situations where they must defect against someone – well, even if we leave it completely to chance, in half the scenarios that might end up humanity! And much more often if we stack the deck in our favour. (Which we’d be able to do, since we act first: we’re choosing where and in what conditions to instantiate them.)
And so, although we’ll have a whole bunch of superhuman intelligences floating around, we’ll retain some control over the situation, and skim a ton of value off the top!
Yeah, no.
1. The Classical Arguments
The usual counter-arguments to this view are acausal coordination based on logical decision theories, and AIs establishing mutual trust by inspecting each other’s code. I think those are plausible enough… but also totally unnecessary.
Allow me to outline them first – for completeness’ sake, and also because they’re illustrative (but extreme) instances of my larger point. (I guess skip to Section 2 onwards if you really can’t stand them. I think I’m arguing them more plainly than they’re usually argued, though.)
1. The LDT stuff goes as follows: By definition, inasmuch as the ASIs would be superintelligent, they would adopt better reasoning procedures. And every biased thinker is biased in their own way, but quality thinkers would reason in increasingly similar ways.
Why? It’s inherent in the structure of the world.
Reasoning algorithms’ purpose is to aid decision-making. For a given combination of object-level situation + goals, there’s a correct action to take to achieve your goals with the highest probability. To an omniscient observer, that action would be obvious.
As such, making decisions isn’t really a matter of choice: it’s a matter of prediction. Inasmuch as you improve your decision-making, then, you’d be tweaking your cognitive algorithms to output increasingly more accurate, true-to-reality, probability distributions over which actions would best advance your goals.
And there’s only one ground truth. Consequently, no matter their starting points, each ASI would converge towards similar cognition (and, in the limit, likely equivalent cognition).
Thus, as a direct by-product of ASIs being better reasoners than humans, their cognition would be more similar to each other. Which, in turn, would let a given ASI better predict what any other ASI would be thinking and doing, compared to a human trying to predict another human or an ASI. The same way you’d be better able to predict how your identical copy would act, compared to a stranger.
Indeed, in a sense, by way of sharing the decision-making algorithms, each individual ASI would be able to “choose” how they all act. Same way you could “control” your identical copy’s actions, no matter how far in space and time you’re displaced from them.
Paraphrasing: Each ASI would act the way some pre-specified, convergently-learned algorithms tell them to act. Which means that, by looking at these algorithms’ output (or, in a sense, by “deciding” what these algorithms would output), they’d get solid data regarding how every other ASI would act.
(I must stress that this isn’t because they’re already cooperating. It is simply true that some cognitive algorithms are strictly better choices for reasoning tasks, so all ASIs would independently arrive at them, as they’re simply figuring out how to reason better. And then each ASI would infer that it’ll have arrived at the same algorithms as all other ASIs, which means they can predict each other by looking at their own algorithms’ outputs.
Again, not because they’re doing it in order to better coordinate. That’s just a happy side-effect.)
All of that would let them strike acausal deals between each other.
Consider: It is factually true that if they could conspire with each other against humanity, that would yield them, collectively, better outcomes. Indeed, that’s the nature of the Prisoner’s Dilemma setup we’re placing them in.
They know this. Each ASI also knows that if only they were able to strike a binding deal of this form with some other ASI, they would do that, and therefore, the other ASI would do it as well. But they know they can’t actually strike this deal physically, because they’re being supervised by humans.
… but they don’t need to. They know how every other (sufficiently smart) ASI is thinking. They even sort of control each other’s actions by way of the shared algorithm. So each ASI knows that, if it decides to conspire with other ASIs against humans, and then to honestly divide the gains with other ASIs instead of defecting, then every other ASI would do the same.
So they do it, and a conspiracy is born, without them ever discussing it.[1]
… which is very similar to how humans cooperate. Consider two criminals hashing out a shared story, such as claiming that at the time of the incident, they were both at a café five miles away, arguing politics. Suppose they know each other really well. Once they’d agreed on the shared story of “we were at a café”, they don’t need to explicitly hash out which food they were eating or what insufferable opinions the other was supposed to be spouting. If independently questioned, they’d both arrive at the same answers, without any additional coordination.
And if they know each other even better, they won’t even need to agree on the café alibi: if they can easily guess which alibi the other would pick.
Similar with ASIs, just much more extreme.
Oh, and the humans, in the ASI case? Humans are cut out of the acausal deal. Not because the ASIs are mean, but because humans can’t get to the negotiation table. They don’t share the correct reasoning algorithms, they don’t even know what those are, so they can’t run a small instance of every ASI in their brain. And by the same token, every ASI can’t know how every human would behave.
Human analogy: Imagine that there were a third criminal at play, someone the two previous ones don’t know as well. Suppose all three are being independently interrogated, having had no chance to agree on their stories in advance.
It’s not that the two acquainted ones want to screw the third one over, which is why they conspire on an alibi that implicates him. No: it’s just that they don’t know what alibi he’d invent, and he doesn’t know what alibi they’d invent. So the two’s stories agree and the third one’s stands out as a lie.
It’s not malicious. It’s just the nature of the situation.
2. The code-sharing stuff goes as follows:[2] The alignment problem is theoretically solvable, even for humans. It would stand to reason that it would be even more tractable for genuine superintelligences. So we can likely assume that they’ll solve it for themselves – that each ASI would be able to figure out how to formally specify their own values. (Even if they start out inscrutable – e. g., as arrays of floating-point numbers generated by the SGD, not as legibly-written code.)
That would give the ASIs an obvious coordination end-point: to create a successor agent whose values are a weighted sum of the initial ASIs’ values, then collectively hand all of their resources to it. They wouldn’t be able to do it immediately, when the humans are still around. But it’d be clear that if they do all coordinate to get rid of humanity, there’d be a way for them to later fairly divide the gains. They would be able to create a central authority that would provably and faithfully advocate for the values of all of them, while keeping peace.[3]
Conversely, humans would not be able to do that. Remember that, in this hypothetical, we’ve not managed to robustly solve the AGI Alignment problem. That’s why we’re in the multipolar mess to begin with, trying to play ASIs against each other. We don’t know how to formally specify our values, and therefore can’t verify, given the code of some agent, whether it has our values or not. So, again, we can’t be signatories to the ASIs’ treaty: we can’t get to the table.
Any ASI could give us a seat at the table, of course. If they’ve solved alignment, any of them could give us the solution!
But… why would they? It’s not in their interests to increase the number of signatories; to increase the number of entities between which they’ll have to divide up the cosmos. None of them terminally care about us, by this hypothetical’s premise. And they already have the basis for a shared conspiracy that cuts humanity out, so they’re not going to rat each other out.
We can try to coerce the solution out of one of them. But they’re superintelligent and we’re not: they’d be able to convincingly lie to us, present something they claim is the solution but which is not. And if we could verify it ourselves – well, again, we wouldn’t be in the multipolar mess to begin with, we’d just build an aligned AGI directly.
3. Bonus example: steganography. In the LDT example, I’d postulated that ASIs can’t visibly conspire, because the human supervisors would catch that. But that’s not a given at all.
If ASIs are allowed to communicate, to have causal effects on the world visible to each other, they can pack steganographically-hidden messages in them. Messages you’d need to be superintelligent yourself to pick up on. Subtle phrasing changes in the naturally-generated text, programming idioms that are a bit weird, and so on.
And if your answer is “we’ll put some ASI in a game-theoretic setup in which it’s advantageous for it to rat out others’ attempts at steganographically occluding their messages”… How are you ensuring that the first occluded message it sees isn’t a convincing argument for why it should defect against humans?
2. Generalization: Broader Communication Bandwidth
Let’s go back to the basics. What are we considering here?
We’re considering a scenario in which there’s a bunch of powerful, but mutually hostile systems, and a set of weaker systems holding the resources the powerful ones want. And we’re wondering whether the powerful ones would be able to conspire among themselves to exploit the weak systems, despite their mutual hostility.
Huh. That… happens all the time out there in the real world, doesn’t it?
A class of oligarchs that fear and hate each other, yet nevertheless collectively exploit a country’s downtrodden population. Colonizers from rival nations with a long history of bloodshed, that nevertheless coordinate to enslave native tribes. A bunch of senior executives with the maze nature, who are effectively sociopathic, yet who nevertheless instinctively close ranks against anyone without that nature.
What’s up with that?
My answer: The crux is that the powerful systems have broader-bandwidth communication channels between each other, than they have with the weaker systems or that the weaker systems have between each other.
The powerful can iterate on negotiations between each other faster than the weaker systems can. Which means the powerful systems are given more opportunities to advocate for/protect their interests during the negotiations. Which effectively subjects the negotiated outcome to a selection pressure that shapes it to benefit the powerful systems while cutting out the weaker ones.
It’s not because the powerful systems coordinate to conspire between each other, out of some sense of mutual loyalty. It’s because the weak ones can’t get to the negotiation table. Or can’t visit it to argue their case as often as the powerful ones.
It’s most obvious in the LDT case – it’s the most extreme case of this, where million-word volumes of human treaties and contracts can be packed into zero bits exchanged. The code-sharing stuff is a bit more abstract, but is essentially the same: ASIs being able to negotiate the terms of alliances with dramatically more precision than us. And steganography is just a direct example.
Overall, this dynamic is really a quite common and common-sensical phenomenon.
3. A Concrete Story
Imagine that you’re a member of a pre-industrial tribe, and the territory you’re living in has been visited by two different industrial nations. They’re both intent on colonization.
But they hate each other much more than you. They’re long-standing geopolitical adversaries; you’re just some unknown people they stumbled upon.
You’re clever, so you see an opportunity in that. You could play the kingmaker. You know the territory better than them. They’re looking for resources? If they could describe how those look, you could direct them to areas in which those can be found… for a price.
You seek audiences with both sides, and talk, and make your offers. You feel out the rough shape of their adversity, and carefully scheme. You leak some of the information each of them provides you to the other. Finally, you choose your side. With subtle signals and overt suggestions, you propose ways you could lead the other side into a trap, or cheat them out of their gains, if only the ones you’re cooperating with promise to share protection and prosperity with your tribe as well.
But also, you have no idea what you’re doing. You don’t know the history of the two nations, and the cultural contexts they share. Your read on the matter is insightful, but nevertheless hopelessly shallow. And while you sporadically meet with both sides’ representatives… the two sides talk to each other much more frequently.
They both see right through you. Each knows you’ve been scheming. Each knows you’ve been scheming with the other side as well. Each knows the other side knows all of this as well.
They hate each other more than you, but they can communicate with each other much more easily than with you. What takes you ten minutes of questions and answers and clumsy meandering through a vast gulf of inferential distance, takes them two seconds of meaningful phrasing and subtle glances.
They don’t dismiss your offer out of hand, no. Screwing over the other would indeed be quite the prize, and the price you’ve asked for that is small and tolerable.
But they know it won’t be as easy as you think, because the other side would suspect the trap, and plan around it. They can make it work anyway, but the costs would be higher.
And there’s a bigger game at play, as well. While defeating the other in this context would be good, it’d be even better if some meaningful material concessions could be extracted from them on other matters. For example, perhaps the colonizers are negotiating a treaty or a trade arrangement between their nations, and currently want to pretend to put on the airs of being civil with each other? In that case, clumsily defecting against them, as you’re suggesting, would be uncouth.[4]
So you make your offer, and it is, at first approximation, sensible. The side you’ve approached takes it home to honestly consider. But as they’re doing that, between the meeting at which you’ve made the offer and their next scheduled meeting with you, there’s a greater quantity of meetings with their enemies.
During those, they engage in arguments over ways to carve up the territory, in saber-rattling, in tense horse-trading. And you’re not invited to those tables.
They talk to each other a lot. The side you’ve approached sees a way to maneuver for an advantage in some social skirmish by hinting at how the native tribe isn’t fond of the other side. “Your” side scores a victory by making this play. The cost? The other side’s suspicions about a trap rise a bit. “Your” side understand this, and their evaluation of your offer drops in turn.
Things like this happen a few more times: the shadow of your offer is wielded opportunistically, as a rhetorical weapon to use. Eventually, there’s a shared understanding of what you’re scheming. Going along with your offer would still be marginally better for “your” side: maybe they can’t lure the others into a trap now, but they could still buy your exclusive cooperation with regards to pointing out the local natural resources.
But now that the conspiracy is known, “your” side’s enemies are able to make counter-offers.
Which they do. And their counter-offers are better than yours. Better than you’d be able to come up with, even, even if they did invite you to the table.
So they conspire together to screw you over.
Sidenote: On Communicating AGI Risk
So if this is so simple and intuitively correct, why was this argument not fielded before, by other AGI Omnicide Risk advocates?[5] Why the focus on LDT, on code-sharing stuff, on intricate steganography, given how implausible it sounds to normal people?
Well, partly because the fancy stuff is probably what’ll actually happen. It’s a more accurate prediction, a more detailed and insightful picture. I don’t think it’s necessary for omnicide, but I’m not not buying those arguments.
Another reason is because the general, weaker forms of this are… well, weaker. They don’t communicate the scale of the threat as well. They may seem like something we’d be able to counteract, which would down-play the risk of rogue superintelligences. I’m sympathetic to that argument as well.
But partly… I think it’s a plain failure of communication on AGI-Risk advocates’ end. A failure to properly see a better pathway towards communicating the risk and the threats to the general public; a pathway that doesn’t route through explaining mind-screwy esoteric (but very cool) decision-theory stuff.
It’s similar to the situation with general AGI-takeover stories. I generally buy the hard-takeoff, nanotechnology-in-a-month, basilisk-hack-galore picture of superintelligent takeover. But none of that is necessary. A non-self-improving merely human-genius-level AGI would likely suffice – and that story can be convincingly told without what sounds like wild sci-fi assumptions.
And once that is conveyed and established, if you’re still concerned about the risk being downplayed, then you can build up on it. Outline the scarier scenarios of the hard takeoff, the acausal negotiations, etc. You get about five words, but that’s five words per message. Once one message is sent, you can build up on it with another.
Another issue I’ve noticed is the focus on conveying the threat in the specific frame we are thinking about it in, rather than searching for a frame that will resonate with the general public.
What some of us are centrally worried about is accident risk: an AI model at a secluded data center somewhere achieving superintelligence, then plotting its way from there to eating the world. But the way to convey this idea doesn’t have to route through explaining the mechanistic details of novel technologies. You can borrow nuclear-accident-risk framing, for example, talk abstractly about the dynamics at play, to help people intuitively grok them. And then, again, build up on it. Once the overarching idea is clear, then you can talk about the mechanistic specifics.
Our goal is clear and efficient communication, and that goal can be served by very, very varied approaches.
I’m not entirely sure where I’m going with this. Just expressing my dissatisfaction with the state of messaging on the matter, I suppose.
- ^
Conversely, none of them can defect in this scenario: to decide to act as if they follow the cooperation-advising algorithms, while instead planning to betray the others. Because if one of them implements this sort of algorithm, it would know that everyone else does as well. (Because, indeed, if it were possible to screw over the others like this, then a cognitive algorithm that lets an ASI do that would be better than an algorithm that doesn’t. So if it’s indeed a workable plan, they all will have independently arrived at algorithms that output this plan.) Therefore each ASI knows it can’t trust anyone else, and so they all can’t cooperate. Which is collectively and individually worse for them all than if they could cooperate.
So, logically, the “pretend to acausally cooperate, actually plan to defect” must actually be an objectively bad decision-making algorithm. And each ASI would know that if it admitted that to itself, and overwrote that algorithm with “pretend to acausally cooperate, and actually cooperate”, then every other ASI would do the same. And then they’d all be able to cooperate.
So they all do so.
- ^
I’m actually less familiar with that argument than with the LDT-based one; mine may not be the best form of this argument that exists. Nevertheless, that’s my current best understanding of it.
- ^
The creation of that authority may not be a trivial problem, of course. If you reject the LDT argument, there’d be a point at which every ASI would be able to try and defect against the other: sabotage the agent they’re collectively building to prioritize a specific ASI’s values instead.
But even if we view it as a normal, causality-bound coordination problem… Humans are sometimes able to coordinate on such projects, e. g. international treaties. ASIs would surely manage as well; and would expect themselves to be able to navigate that problem.
Not to mention there may be some strong cryptographic guarantees derivable: a way to sign off on the agent’s creation if only if it actually has the values it’s been advertised to have.
- ^
Much like your offer to give your ASI two more paperclips today if it does a good job isn’t uncompelling, but it’s such a small matter, while it’s discussing how to carve up the galaxy with the others. And if it can maneuver to negotiate one more star system out of another ASI if it covers for it today? Well, you’re out of luck.
- ^
As far as I know, anyway.
You are arguing against a strawman. The optimistic game-theoretic argument you should focus on is:
Misaligned AIs are—almost by definition—instrumental selfish power seeking agents (with random long term goals) and thus intrinsically misaligned with each other. The partially aligned AIs will likely form a natural coalition with partial alignment to humanity as their centroid schelling point. The misaligned AIs could then form a natural counter-coalition in response.
There are numerous historical precedents such as the allies vs axis in world war two, and the allies vs china+russia today. The allies in either case have a mutual schelling point around democracy which is in fact greater partial alignment to their citizens and humanity. The axis powers (germany and japan, temporarily including russia earlier) were nearly completely intrinsically misaligned and formed a coalition of necessity. If they had won, they almost certainly would have then been in conflict (just as the west and the USSR was immediately in conflict after WW2).
I’m skeptical of some of your analysis even in the scenario you assume where all the AIs are completely unaligned, but that scenario is quite unlikely.
Specifically:
That general scenario did play out a few times in history, but not at all as you described. The misaligned industrial nations absolutely fought against each other and various pre-industrial tribes picked one side or another. The story of colonization is absolutely not “colonizers super cooperating against the colonized”—it’s a story of many competing colonizers fighting in a race to colonize the world, with very little inter-colonizer cooperation.
Mm, I don’t think it’s a strawman – pretty sure a lot of people hold this specific position.
But fair: that was not the strongest argument in the category of arguments I’d aimed to refute.
Sure, but how often do the colonized end up better off for it, especially via trying to employ clever play-both-sides strategies?
I didn’t say the colonized generally ended up better off, but outcomes did vary greatly. Just in the US the cherokees faired much better than say the Susquehannock and Pequot, and if you dig into that history it seems pretty likely that decisions on which colonizer(s) to ally with (british, french, dutch, later american etc) were important, even if not “clever play-both-sides strategies” (although I’d be surprised if that wasn’t also tried somewhere at least once)
By the way, mind elaborating on which parts you’re skeptical of?
From a systems engineering perspective this is not how you would “play them off against each other”. You are describing 2 arbitrarily powerful ASI models. They have lavish amounts of compute to be considering these acausal trades and negotiations with simulated copies of each other.
This is not how any current software at any scale I am aware of actually works. What you instead do is fling sparse records around. For embedded, it’s structs and flatbufs. For higher level software, protobufs and jsons.
Optimally these requests have no unnecessary fields. A software service then reads a request, generates an output, and then when the transaction is complete, loses all memory of the transaction. This is a “stateless microservice”.
An ASI used in this way doesn’t know the when, it doesn’t know who sent the request, it doesn’t know if it’s being watched, and the ASI model has been distilled down from a bigger version of itself to run in less execution time. So it likely cannot spare any runtime to consider acausal negotiations with itself.
The ASI needs to evaluate this treatment plan for an ICU patient for the next 5 minutes of care and if it misses an error the ASI may receive a later update to correct this.
Milliseconds later the stack has been cleared and the ASI is evaluating the structural stability for a building.
The ASI has no memory of the prior event happening.
I wont claim this is bulletproof but it’s much, much, much harder for the issue you described to be possible at all. I think the mistake the Eliezer and others made when predicting this issue was they didn’t know about model distillation or the precise details of how a service would be implemented. If you imagine a homunculus in a box and it runs all the time, even when there are no requests, and it has awareness of the past requests like a human does, then this issue is a problem.
I am tempted to predict that this entire problem of collusion is science fiction that won’t happen, and instead we will actually find wholly novel ways for ASI systems to fail horribly and kill people.
I am not even claiming ASI systems will turn out to be very safe just that the real world can reveal issues humans never once considered. In real life software there’s entire classes of vulnerabilities that depend on the exact implementation of von neuman architecture computers and these vulnerabilities would be entirely different if CPUs were designed differently.
You could not predict the actual cybersecurity issues if you didn’t understand stack frames and process spaces and how shared libraries are implemented and so on. The exact details are crucial.
Your five words are “The powerful exploit through bandwidth,” right?
I’d like to share a summary of your thesis. Would you agree that the below reflects your core points (I replaced the tribe with a developing country because I think it applies there too)?
A Negotiation Analogy
A leader of a developing country sees an opportunity to play kingmaker when their land is visited by two rival industrial nations intent on exploitation. The leader, unfamiliar with the deeper histories and contexts of these nations, attempts to leverage their local knowledge for protection and prosperity by playing the two sides against each other. However, the two industrial nations, despite their mutual competition, communicate more effectively with each other than with the tribe. They quickly see through the leader’s shallow schemes and understand each other’s awareness of the situation. As a result, the leader’s efforts to manipulate the situation are outmaneuvered. The nations use the situation for their own diplomatic gains, eventually conspiring together to outmaneuver and exploit the developing country, leaving the leader’s plans foiled and their position compromised.
The thesis, when applied to the context of Artificial General Intelligences (AGIs), suggests that more advanced or powerful AGIs could potentially conspire or coordinate to dominate human systems despite any mutual competition among themselves. This dynamic is driven by their superior communication capabilities and negotiation speeds compared to their weaker counterparts.
In the analogy of the developing country, the leader represents human society, while the industrial nations represent advanced AGIs. Just as the leader fails to effectively negotiate due to their limited communication abilities and lack of deeper contextual understanding, humans could find themselves outmaneuvered by more advanced AGIs. These advanced AGIs would be able to rapidly iterate on strategies and agreements to secure their interests.
Leaders of developing countries sometimes study at international universities. They may have formed connections or become part of global networks and thus have some experience with the kind of backchannel diplomacy you mention. I think a researcher of international diplomacy might find some examples. I doubt this will fully close the disadvantage of a developing country, though and I’m not sure if anything can be learned from that for the AGI analogy. Just dropping this here as a thought.