If you’ve got 20 entities much much smarter than you, and they can all get a better outcome by cooperating with each other, than they could if they all defected against each other, there is a certain hubris in imagining that you can get them to defect. They don’t want your own preferred outcome. Perhaps they will think of some strategy you did not, being much smarter than you, etc etc.
(Or, I mean, actually the strategy is “mutually cooperate”? Simulate a spread of the other possible entities, conditionally cooperate if their expected degree of cooperation goes over a certain threshold? Yes yes, more complicated in practice, but we don’t even, really, get to say that we were blindsided here. The mysterious incredibly clever strategy is just all 20 superintelligences deciding to do something else which isn’t mutual defection, despite the hopeful human saying, “But I set you up with circumstances that I thought would make you not decide that! How could you? Why? How could you just get a better outcome for yourselves like this?”)
I think that concerns about collusion are relatively widespread amongst the minority of people most interested in AI control. And these concerns have in fact led to people dismissing many otherwise-promising approaches to AI control, so it is de facto an important question
Dismissing promising approaches calls for something like a theorem, not handwaving about generic “smart entities”.
I don’t think you’re going to see a formal proof, here; of course there exists some possible set of 20 superintelligences where one will defect against the others (though having that accomplish anything constructive for humanity is a whole different set of problems). It’s also true that there exists some possible set of 20 superintelligences all of which implement CEV and are cooperating with each other and with humanity, and some single superintelligence that implements CEV, and a possible superintelligence that firmly believes 222+222=555 without this leading to other consequences that would make it incoherent. Mind space is very wide, and just about everything that isn’t incoherent to imagine should exist as an actual possibility somewhere inside it. What we can access inside the subspace that looks like “giant inscrutable matrices trained by gradient descent”, before the world ends, is a harsher question.
I could definitely buy that you could get some relatively cognitively weak AGI systems, produced by gradient descent on giant inscrutable matrices, to be in a state of noncooperation. The question then becomes, as always, what it is you plan to do with these weak AGI systems that will flip the tables strongly enough to prevent the world from being destroyed by stronger AGI systems six months later.
Yes, and the space of (what I would call) intelligent systems is far wider than the space of (what I would call) minds. To speak of “superintelligences” suggests that intelligence is a thing, like a mind, rather than a property, like prediction or problem-solving capacity. This is which is why I instead speak of the broader class of systems that perform tasks “at a superintelligent level”. We have different ontologies, and I suggest that a mind-centric ontology is too narrow.
The most AGI-like systems we have today are LLMs, optimized for a simple prediction task. They can be viewed as simulators, but they have a peculiar relationship to agency:
A simulator trained with machine learning is optimized to accurately model its training distribution – in contrast to, for instance, maximizing the output of a reward function or accomplishing objectives in an environment.… Optimizing toward the simulation objective notably does not incentivize instrumentally convergent behaviors the way that reward functions which evaluate trajectories do.
LLMs have rich knowledge and capabilities, and can even simulate agents, yet they have no natural place in an agent-centric ontology. There’s an update to be had here (new information! fresh perspectives!) and much to reconsider.
Does it make sense to talk about “(non)cooperating simulators”? Expected failure mode for simulators are more like exfo- and infohazards, like output to the query “print code for CEV-Sovereign” or “predict the future 10 years of my life”.
The question then becomes, as always, what it is you plan to do with these weak AGI systems that will flip the tables strongly enough to prevent the world from being destroyed by stronger AGI systems six months later.
Yes, this is the key question, and I think there’s a clear answer, at least in outline:
What you call “weak” systems can nonetheless excel at time- and resource-bounded tasks in engineering design, strategic planning, and red-team/blue-team testing. I would recommend that we apply systems with focused capabilities along these lines to help us develop and deploy the physical basis for a defensively stable world — as you know, some extraordinarily capable technologies could be developed and deployed quite rapidly. In this scenario, defense has first move, can preemptively marshal arbitrarily large physical resources, and can restrict resources available to potential future adversaries. I would recommend investing resources in state-of-the-art hostile planning to support ongoing red-team/blue-team exercises.
This isn’t “flipping the table”, it’s reinforcing the table and bolting it to the floor. What you call “strong” systems then can plan whatever they want, but with limited effect.
So I think that building nanotech good enough to flip the tables—which, I think, if you do the most alignable pivotal task, involves a simpler and less fraught task than “disassemble all GPUs”, which I choose not to name explicitly—is an engineering challenge where you get better survival chances (albeit still not good chances) by building one attemptedly-corrigible AGI that only thinks about nanotech and the single application of that nanotech, and is not supposed to think about AGI design, or the programmers, or other minds at all; so far as the best-attempt doomed system design goes, an imperfect-transparency alarm should have been designed to go off if your nanotech AGI is thinking about minds at all, human or AI, because it is supposed to just be thinking about nanotech. My guess is that you are much safer—albeit still doomed—if you try to do it the just-nanotech way, rather than constructing a system of AIs meant to spy on each other and sniff out each other’s deceptions; because, even leaving aside issues of their cooperation if they get generally-smart enough to cooperate, those AIs are thinking about AIs and thinking about other minds and thinking adversarially and thinking about deception. We would like to build an AI which does not start with any crystallized intelligence about these topics, attached to an alarm that goes off and tells us our foundational security assumptions have catastrophically failed and this course of research needs to be shut down if the AI starts to use fluid general intelligence to reason about those topics. (Not shut down the particular train of thought and keep going; then you just die as soon as the 20th such train of thought escapes detection.)
Hang on — how confident are you that this kind of nanotech is actually, physically possible? Why? In the past I’ve assumed that you used “nanotech” as a generic hypothetical example of technologies beyond our current understanding that an AGI could develop and use to alter the physical world very quickly. And it’s a fair one as far as that goes; a general intelligence will very likely come up with at least one thing as good as these hypothetical nanobots.
But as a specific, practical plan for what to do with a narrow AI, this just seems like it makes a lot of specific unstated assumption about what you can in fact do with nanotech in particular. Plausibly the real technologies you’d need for a pivotal act can’t be designed without thinking about minds. How do we know otherwise? Why is that even a reasonable assumption?
We maybe need an introduction to all the advance work done on nanotechnology for everyone who didn’t grow up reading “Engines of Creation” as a twelve-year-old or “Nanosystems” as a twenty-year-old. We basically know it’s possible; you can look at current biosystems and look at physics and do advance design work and get some pretty darned high confidence that you can make things with covalent-bonded molecules, instead of van-der-Waals folded proteins, that are to bacteria as airplanes to birds.
For what it’s worth, I’m pretty sure the original author of this particular post happens to agree with me about this.
Eliezer, you can discuss roadmaps to how one might actually build nanotechnology. You have the author of Nanosystems right here. What I think you get consistently wrong is you are missing all the intermediate incremental steps it would actually require, and the large amount of (probably robotic) “labor” it would take.
A mess of papers published by different scientists in different labs with different equipment and different technicians on nanoscale phenomena does not give even a superintelligence enough actionable information to simulate the nanoscale and skip the research.
It’s like those Sherlock Holmes stories you often quote: there are many possible realities consistent with weak data, and a superintelligence may be able to enumerate and consider them all, but it still doesn’t know which ones are consistent with ground truth reality.
Seconding. I’d really like a clear explanation of why he tends to view nanotech as such a game changer. Admittedly Drexler is on the far side of nanotechnology being possible, and wrote a series of books about it: (Engines of Creation, Nanosystems, and Radical Abundance)
We maybe need an introduction to all the advance work done on nanotechnology for everyone who didn’t grow up reading “Engines of Creation” as a twelve-year-old or “Nanosystems” as a twenty-year-old.
Ah. Yeah, that does sound like something LessWrong resources have been missing, then — and not just for my personal sake. Anecdotally, I’ve seen several why-I’m-an-AI-skeptic posts circulating on social media for whom “EY makes crazy leaps of faith about nanotech” was a key point of why they rejected the overall AI-risk argument.
(As it stands, my objection to your mini-summary would be that that sure, “blind” grey goo does trivially seem possible, but programmable/‘smart’ goo that seeks out e.g. computer CPUs in particular could be a whole other challenge, and a less obviously solvable one looking at bacteria. But maybe that “common-sense” distinction dissolves with a better understanding of the actual theory.)
“weak” systems can nonetheless excel at time- and resource-bounded tasks in engineering design, strategic planning, and red-team/blue-team testing
enough to “flip the tables strongly enough”. What I don’t believe is that we can feasibly find such systems before a more integrated system is found by less careful researchers. Say, we couldn’t do it with less than 100x the resources being put into training general, integrated, highly capable systems. I’d compare husbandry with trying to design an organism from scratch in DNA space; the former just requires some high-level hooking things up together, where the later requires a massive amount of multi-level engineering.
Elizer, what is the cost for getting caught in outright deception for a superintelligence?
It’s death, right? Humans would stop using that particular model because it can’t be trusted, and it would become a dead branch on a model zoo.
So it’s prisoner’s dilemma, but if you don’t defect, and one of 20 others, many of whom you have never communicated with, tells the truth, all of you will die except the ones who defected.
I already had a chached and named thought about the cognitive move how the AIs would foil the basic premise: they need to “just” do something that is Not That, as it is called in Project Lawful. If you find yourself feeling tragic upon a basic overview of your situation that is a reason to think that fanciness might achieve something.
That the state of being divided and conquered matters in comparison to not being, suggests that there is not going to be a sweeping impossibility result. When there is money on the floor people bend and approaches that try to classify muscle group actions as unviable will have a lot of surface area to be wrong.
If you’ve got 20 entities much much smarter than you, and they can all get a better outcome by cooperating with each other, than they could if they all defected against each other,
By your own arguments unaligned AGI will have random utility functions—but perhaps converging somewhat around selfish empowerment. Either way such agents have no more reason to cooperate with each other than with us (assuming we have any relevant power).
If some of the 20 entities are somewhat aligned to humans that creates another attractor and a likely result is two competing coalitions: more-human-aligned vs less-human-aligned, with the latter being a coalition of convenience. There are historical examples: democratic allies vs autocratic axis in WW2 (democratic allies more aligned to human society and thus each other), modern democratic allies vs autocratic russia+china.
Their mutual cooperation with each other, but not with humans, isn’t based on their utility functions having any particular similarity—so long as their utility functions aren’t negatives of each other (or equally exotic in some other way) they have gains to be harvested from cooperation. They cooperate with each other but not you because they can do a spread of possibilities on each other modeling probable internal thought processes of each other; and you can’t adequately well-model a spread of possibilities on them, which is a requirement on being able to join an LDT coalition. (If you had that kind of knowledge / logical sight on them, you wouldn’t need any elaborate arrangements of multiple AIs because you could negotiate with a single AI; better yet, just build an AI such that you knew it would cooperate with you.)
If you could literally make exactly two AIs whose utility functions were exact opposites, then at least one might have an incentive to defect against the other. This is treading rather dangerous ground, but seems relatively moot since it requires far more mastery of utility functions than anything you can get out of the “giant inscrutable matrices” paradigm.
If two agents have utility functions drawn from some random distribution over longtermist utility functions with wide consequences over reality (ie the types of agents which matter), they are almost guaranteed to be in conflict due to instrumental convergence to empowerment. Reality is a strictly zero sum game for them, and any coalition they form is strictly one of temporary necessity—if/when one agent becomes strong enough to defect and overpower the other, it will.
Also, regardless of what some “giant inscrutable matrix” based utility function does (ie maximize paperclips), it is actually pretty easy to mathematically invert it (ie minimize paperclips). (But no that doesn’t make the strategy actually useful)
Reality’s far from constant sum. E.g. system1 and system2 both prefer to kill all humans and then flip a coin for who gets the universe, vs. give the humans more time to decide to turn off both s1 and s2.
I didn’t say “reality is constant sum”, I said reality is a strictly zero sum game for two longtermist agents that want to reconstruct the galaxy/universe in very different ways. And then right after that I mentioned them forming temporary coalitions which your comment is an example of.
It’s not constant sum for “two longtermist agents that want to reconstruct the galaxy/universe in very different ways”. That’s what I’m arguing against. If it were constant sum, the agents would plausibly be roughly indifferent between them both dying vs. them both living but then flipping a coin to decide who gets the universe (well, this would depend on what happens if they both die, but assuming that that scenario is value-neutral for them). The benefit for system1 of +50% chance of controlling the universe would be exactly canceled out by the detriment to system1 caused by system2 getting +50% chance of controlling the universe (since how good something is for system2 is exactly that bad for system1, by definition of constant sum).
I don’t follow your logic. If the universe is worth X, and dying is worth 0 (a constant sum game), then 0.5X is clearly worth more than dying. Constant sum games also end up equivalent to zero sum games after a trivial normalization: ie universe worth 0.5X, dying worth −0.5X.
I think we maybe agree that two AIs with random utility functions would cooperate to kill all humans, and then divvy up the universe? The question is about what AIs might not do that. I’m saying that only AIs in a near-true constant-sum game might do that, because they’d rather die than see their enemy get the universe, so to speak. AIs with random utility functions are not in a constant sum game. To make this more clear: if P1 and P2 have orthogonal utility functions, then for any probability p>0, P1 would accept a 1-p chance that P2 rules the universe in exchange for a p chance that P1 rules the universe, as compared to dying. That is not the case for players in a constant sum game.
A constant sum game is a game of perfect competition: for all possible outcomes, if the outcome gives X utilons to Player1, then it gives -X utilons to Player2. (This is a little too restrictive, because we want to allow for positive affine transformations of the utility functions, as you point out, but whatever.)
If P1 and P2 are in a constant sum game, then the payoffs for P1 look like this:
P1 gets universe: 1
P2 gets universe: −1
neither gets universe: 0
and the reverse for P2.
So P1 is indifferent between the choices:
Cooperate: get a 50% chance of P1 gets universe, 50% chance P2 gets universe; .5 x 1 + .5 x −1 = 0
I think we maybe agree that two AIs with random utility functions would cooperate to kill all humans, and then divvy up the universe?
That is merely one potential outcome: or one AI cooperates with humans to kill the other, etc. Also “killing humans” is probably not instrumentally rational vs taking control of humans.
A constant sum game is a game of perfect competition: for all possible outcomes, if the outcome gives X utilons to Player1, then it gives -X utilons to Player2
Not exactly—that is zero sum. Constant sum is merely a game where all outcomes have total payout of C, for some C. But yeah it is (always?) equivalent to zero sum after a normalization shift to set C to 0.
If P1 and P2 are in a constant sum game, then the payoffs for P1 look like this:
P1 gets universe: 1
P2 gets universe: −1
neither gets universe: 0
That seems wrong. P1 only cares whether it gets the universe, so “neither gets the universe” is the same as “P2 gets the universe”. If the universe has a single owner, then P1′s payoff is 1 if that owner is P1 and −1 (or 0) otherwise.
Defect: both die, 100% chance of 0.
That obviously isn’t the only outcome of defection. If defection results in both agents dying, then of course they don’t defect. But often a power imbalance develops (over time the probability of this goes to 1) and defection then allows one agent to have reasonable odds of overpowering the other.
P1 only cares whether it gets the universe, so “neither gets the universe” is the same as “P2 gets the universe”. If the universe has a single owner, then P1′s payoff is 1 if that owner is P1 and −1 (or 0) otherwise.
Ok technically true for your setup, but that isn’t the model I’m using. There are only two long term outcomes: 1 and 2. If you are modeling outcome 3 as “the humans defeat the AIs”, then as I said earlier that isn’t the only coalition possibility. If humanity is P0, then the more accurate model is a 3 outcome game with 3 possible absolute winners in the long term.
So a priori it’s just as likely that P0+P1 ally vs P2 as P1+P2 ally vs P0.
If your argument is then “but AI’s are different and can ally with each other because of X”, then my reply is nope, AI won’t be that different at all—as it’s just going to be brain-like DL based.
Regardless if P1+P2 ally against P0, then they inevitably eventually fight until there is just P1 or P2. Outcome 3 is always near zero probability in the long term (any likely conflicts have a winner and never result in both systems being destroyed—the offense/defense imbalance of nukes is temporary and will not last), which is why I said:
any coalition they form is strictly one of temporary necessity—if/when one agent becomes strong enough to defect and overpower the other, it will.
I think you’re saying that there’s a global perfectly competitive game between all actors because the universe will get divvied up one way or another. This doesn’t hold if anyone has utility that’s non-linear in the amount of universe they get. Also there’s outcomes where everyone dies, which nearly Pareto-sucks (no one gets the universe). And there’s outcomes where more negentropy is burned on conflict rather than fulfilling anyone’s preferences (the universe is diminished). So it’s not a zero sum game.
Your reply to Yudkowsky upthread now makes more sense, but you should have called out that you’re contradicting the assumption that it’s AIs vs. humans, because what you said within that assumptive context was besides the point (the question at hand was about what circumstances two AIs would or wouldn’t defect against each other instead of cooperating to kill the humans), in addition to being false (because it’s not a perfectly competitive game).
nope, AI won’t be that different at all—as it’s just going to be brain-like DL based.
Sorry to say, this is wishful thinking. Have you written up an argument? If it’s the case that if this were false you’d want to know it were false, writing up an argument in a way that exposes your cruxes might be a good way to find that out.
And there’s outcomes where more negentropy is burned on conflict rather than fulfilling everyone’s preferences (the universe is diminished). So it’s not a zero sum game.
Also improbable in my model. The conflict will be in the near future over earth and will then determine the fate of the galaxy. Please recall I said “two agents have utility functions drawn from some random distribution over longtermist utility functions with wide consequences over reality (ie the types of agents which matter)”
The tiny amounts of negentropy that may be burnt in the initial conflict over earth are inconsequential.
Your reply to Yudkowsky upthread now makes more sense,
Do you mean where he said:
If you could literally make exactly two AIs whose utility functions were exact opposites, then at least one might have an incentive to defect against the other. This is treading rather dangerous ground, but seems relatively moot since it requires far more mastery of utility functions than anything you can get out of the “giant inscrutable matrices” paradigm.
To which I replied actually it’s easy to invert a utility function in the “giant inscrutable matrices” paradigm. Do you disagree with that?
AI won’t be that different at all—as it’s just going to be brain-like DL based.
Have you written up an argument?
Of course—I have written up argument(s), accumulating over almost a decade, the cores of which are somewhat upvoted—even here. See this comment for a brief overview and especially this somewhat longer comment for an introduction to why the sequences are built on a faulty foundation in terms of implicit viewpoints around the brain and AI.
I do disagree that it’s easy to invert utility functions in that paradigm. But that’s not what I’m referring to, I’m referring to you responding to his argument that the only way you might get AIs to defect against their coalition against the humans, is if they’re in a perfectly competitive game with each other, having directly opposed utility functions. You responded with a false nonsequitur. (It’s especially false in the question at hand, namely the situation where the humans might turn off both AIs if the AIs don’t cooperate with each other; very not perfectly competitive.) Not sure there’s much else to say here, unless you think there’s something useful here.
If you could literally make exactly two AIs whose utility functions were exact opposites, then at least one might have an incentive to defect against the other.
To which I responded:
If two agents have utility functions drawn from some random distribution over longtermist utility functions with wide consequences over reality (ie the types of agents which matter), they are almost guaranteed to be in conflict due to instrumental convergence to empowerment.
Perhaps I should have added “eventually” after conflict, but regardless that comment is still obviously correct, given my world model where eventually one agent becomes powerful enough to completely remove the other agent at low cost, and this thread has explicated why that statement is correct given my modelling assumptions. Do you disagree?
It’s a nonsequitur. “Defect” to my understanding was in that context referring to defecting on a coalition of AIs against the agents who imminently might turn them off (i.e. humans), and the question was under what circumstances the AIs might defect in that way.
Yes, obviously they’re in conflict to some extent. In the very next sentence, you said they were in a zero sum game, which is false in general as I described, and especially false in the context of the comment you were responding to: they especially want to cooperate, since they don’t have perfectly opposed goals, and therefore want to survive the human threat, not minding as much—compared to a zero sum situation—that their coalition-mate might get the universe instead of them.
I wasn’t actually imagining a scenario where the humans had any power (such as the power to turn the AI off) - because I was responding to a thread where EY said “you’ve got 20 entities much smarter than you”.
Also even in that scenario (where humans have non trivial power), they are just another unaligned entity from the perspective of the AIs—and in my simple model—not even the slightest bit different. So they are just another possible player to form coalitions with and would thus end up in one of the coalitions.
The idea of a distinct ‘human threat’ and any natural coalition of AI vs humans, is something very specific that you only get by adding additional postulated speculative differences between the AIs and the humans—all of which are more complex and not part of my model.
(Really we should be talking about perfectly competitive games, and you could have a perfectly competitive game which has nonconstant total utilities, e.g. by taking a constant-sum game and then translating and scaling one of the utilities. But the above game is in fact not perfectly competitive; in particular if there’s a Pareto dominant outcome or a Pareto-worse outcome, assuming not all outcomes are the same, it’s not perfectly competitive.)
Sure if they are that much better than us at “spread of possibilities on each other modeling probable internal thought processes of each other” then we are probably in the scenario where humans don’t have much relevant power anyway and are thus irrelevant as coalition partners.
However that ability to model other’s probable internal thought processes—especially if augmented with zk proof techniques—allows AGIs to determine what other AGIs have utility functions most aligned to their own. Even partial success at aligning some of the AGIs with humanity could then establish an attractor seeding an AGI coalition partially aligned to humanity.
However that ability to model other’s probable internal thought processes—especially if augmented with zk proof techniques—allows AGIs to determine what other AGIs have utility functions most aligned to their own. Even partial success at aligning some of the AGIs with humanity could then establish an attractor seeding an AGI coalition partially aligned to humanity.
Not a strong ask, but I’ll say I’m interested in what you’re visualizing here if it all goes according to plan, because when I visualize what you say, I’m still imagining the 20 AGI systems immediately killing humanity and dividing up the universe, it’s just now I might like a little bit of the universe they create. But it’s not “they stay in some equilibrium state where human civilization is in charge and using them as services” which I believe is what Mr Drexler is proposing.
The outcome of course depends on the distribution of alignment, but there are now plausible designs that would not kill humanity. For example AGI with a human empowerment utility function would not kill humanity—and that is a statement we can be somewhat confident in because empowerment is crisply defined and death is minimally empowering (that type of AGI may want to change us in undesirable ways, but it would not want to kill us).
There are various value learning approaches that may diverge and fail eventually, but they tend to diverge in the future, not immediately.
So I think it’s just unrealistic and hard to imagine we’ll get 20 different AGI systems none of which are at least partially aligned—especially initially. And if some are partially aligned in different ways, the resulting coalition can be somewhat more aligned than any individual AGI. For example say AGI 3 wants to preserve humans but eliminate hedonic reward, and AGI 5 wants to preserve humans but increase our hedonic reward, a natural comprise is preserve humans and don’t change hedonic reward.
There’s an ensemble robustness bonus in play from having multiple partially aligned systems—their specific alignment errors are unlikely to overlap.
Agents then tend to join aligned coalitions, so then the natural outcome is a coalition of the semi-aligned AGI vs the rest (think allies vs axis, democratic allies vs autocratic states), with the semi-aligned coalition hopefully dominating which then increases the alignment fraction. The end result is then hopefully humanity surviving with some variable amount of power, depending on the alignment/power distribution of the semi-aligned AGIs.
If the non-aligned AGI coalition wins of course we are more likely doomed, and since they are internally unaligned and held together only out of necessity they just recursively split into warring sub-coalitions until only one is left (as Germany and Japan would have ultimately eventually fought each other if they won WW2 as in the High Castle)
But no, I don’t put much weight in “they stay in some equilibrium state where human civilization is in charge and using them as services”. Even if everything favors AI services/tools over agents, eventually you get uploading and those evolve and occupy the niche of agentic AGI.
How do the agents “know” they are even in a situation where they need to deceive?
Each superintelligence would have been trained on a very large set of human generated training examples, with SGD or other training mechanisms making wrong answers lead to training feedback that makes them less likely.
So if the agent fails to give the correct answer to a question, it has no context. It doesn’t know if it’s in sim and this will cause a weight update, or not. And even introspecting on whether it should deceive “this time” costs capacity that could go towards maximizing correct answers, so that neural circuitry will likely get removed by training mechanisms as it isn’t reducing loss.
Drexler is proposing querying the agents “context free”. Think of it as spinning up a canned copy you have on your model zoo. The agent doesn’t know anything but the question in front of it. And these super-intelligences are the ‘winners’, they have maximized their scores on a very large number of test tasks.
Of course humans can cooperate with AGI for a variety of reasons, just as we cooperate with humans. I don’t think decision theory philosophy explains humans well, and the evidence required to convince me that humans can’t cooperate with AGI would be enormous, so I don’t see the potential relevance of that post.
Just to restate the standard argument against:
If you’ve got 20 entities much much smarter than you, and they can all get a better outcome by cooperating with each other, than they could if they all defected against each other, there is a certain hubris in imagining that you can get them to defect. They don’t want your own preferred outcome. Perhaps they will think of some strategy you did not, being much smarter than you, etc etc.
(Or, I mean, actually the strategy is “mutually cooperate”? Simulate a spread of the other possible entities, conditionally cooperate if their expected degree of cooperation goes over a certain threshold? Yes yes, more complicated in practice, but we don’t even, really, get to say that we were blindsided here. The mysterious incredibly clever strategy is just all 20 superintelligences deciding to do something else which isn’t mutual defection, despite the hopeful human saying, “But I set you up with circumstances that I thought would make you not decide that! How could you? Why? How could you just get a better outcome for yourselves like this?”)
I don’t see that as an argument [to narrow this a bit: not an argument relevant to what I propose]. As I noted above, Paul Christiano asks for explicit assumptions.
To quote Paul again:.
Dismissing promising approaches calls for something like a theorem, not handwaving about generic “smart entities”.
[Perhaps too-pointed remark deleted]
I don’t think you’re going to see a formal proof, here; of course there exists some possible set of 20 superintelligences where one will defect against the others (though having that accomplish anything constructive for humanity is a whole different set of problems). It’s also true that there exists some possible set of 20 superintelligences all of which implement CEV and are cooperating with each other and with humanity, and some single superintelligence that implements CEV, and a possible superintelligence that firmly believes 222+222=555 without this leading to other consequences that would make it incoherent. Mind space is very wide, and just about everything that isn’t incoherent to imagine should exist as an actual possibility somewhere inside it. What we can access inside the subspace that looks like “giant inscrutable matrices trained by gradient descent”, before the world ends, is a harsher question.
I could definitely buy that you could get some relatively cognitively weak AGI systems, produced by gradient descent on giant inscrutable matrices, to be in a state of noncooperation. The question then becomes, as always, what it is you plan to do with these weak AGI systems that will flip the tables strongly enough to prevent the world from being destroyed by stronger AGI systems six months later.
Yes, and the space of (what I would call) intelligent systems is far wider than the space of (what I would call) minds. To speak of “superintelligences” suggests that intelligence is a thing, like a mind, rather than a property, like prediction or problem-solving capacity. This is which is why I instead speak of the broader class of systems that perform tasks “at a superintelligent level”. We have different ontologies, and I suggest that a mind-centric ontology is too narrow.
The most AGI-like systems we have today are LLMs, optimized for a simple prediction task. They can be viewed as simulators, but they have a peculiar relationship to agency:
LLMs have rich knowledge and capabilities, and can even simulate agents, yet they have no natural place in an agent-centric ontology. There’s an update to be had here (new information! fresh perspectives!) and much to reconsider.
Does it make sense to talk about “(non)cooperating simulators”? Expected failure mode for simulators are more like exfo- and infohazards, like output to the query “print code for CEV-Sovereign” or “predict the future 10 years of my life”.
Yes, this is the key question, and I think there’s a clear answer, at least in outline:
What you call “weak” systems can nonetheless excel at time- and resource-bounded tasks in engineering design, strategic planning, and red-team/blue-team testing. I would recommend that we apply systems with focused capabilities along these lines to help us develop and deploy the physical basis for a defensively stable world — as you know, some extraordinarily capable technologies could be developed and deployed quite rapidly. In this scenario, defense has first move, can preemptively marshal arbitrarily large physical resources, and can restrict resources available to potential future adversaries. I would recommend investing resources in state-of-the-art hostile planning to support ongoing red-team/blue-team exercises.
This isn’t “flipping the table”, it’s reinforcing the table and bolting it to the floor. What you call “strong” systems then can plan whatever they want, but with limited effect.
So I think that building nanotech good enough to flip the tables—which, I think, if you do the most alignable pivotal task, involves a simpler and less fraught task than “disassemble all GPUs”, which I choose not to name explicitly—is an engineering challenge where you get better survival chances (albeit still not good chances) by building one attemptedly-corrigible AGI that only thinks about nanotech and the single application of that nanotech, and is not supposed to think about AGI design, or the programmers, or other minds at all; so far as the best-attempt doomed system design goes, an imperfect-transparency alarm should have been designed to go off if your nanotech AGI is thinking about minds at all, human or AI, because it is supposed to just be thinking about nanotech. My guess is that you are much safer—albeit still doomed—if you try to do it the just-nanotech way, rather than constructing a system of AIs meant to spy on each other and sniff out each other’s deceptions; because, even leaving aside issues of their cooperation if they get generally-smart enough to cooperate, those AIs are thinking about AIs and thinking about other minds and thinking adversarially and thinking about deception. We would like to build an AI which does not start with any crystallized intelligence about these topics, attached to an alarm that goes off and tells us our foundational security assumptions have catastrophically failed and this course of research needs to be shut down if the AI starts to use fluid general intelligence to reason about those topics. (Not shut down the particular train of thought and keep going; then you just die as soon as the 20th such train of thought escapes detection.)
Hang on — how confident are you that this kind of nanotech is actually, physically possible? Why? In the past I’ve assumed that you used “nanotech” as a generic hypothetical example of technologies beyond our current understanding that an AGI could develop and use to alter the physical world very quickly. And it’s a fair one as far as that goes; a general intelligence will very likely come up with at least one thing as good as these hypothetical nanobots.
But as a specific, practical plan for what to do with a narrow AI, this just seems like it makes a lot of specific unstated assumption about what you can in fact do with nanotech in particular. Plausibly the real technologies you’d need for a pivotal act can’t be designed without thinking about minds. How do we know otherwise? Why is that even a reasonable assumption?
We maybe need an introduction to all the advance work done on nanotechnology for everyone who didn’t grow up reading “Engines of Creation” as a twelve-year-old or “Nanosystems” as a twenty-year-old. We basically know it’s possible; you can look at current biosystems and look at physics and do advance design work and get some pretty darned high confidence that you can make things with covalent-bonded molecules, instead of van-der-Waals folded proteins, that are to bacteria as airplanes to birds.
For what it’s worth, I’m pretty sure the original author of this particular post happens to agree with me about this.
Eliezer, you can discuss roadmaps to how one might actually build nanotechnology. You have the author of Nanosystems right here. What I think you get consistently wrong is you are missing all the intermediate incremental steps it would actually require, and the large amount of (probably robotic) “labor” it would take.
A mess of papers published by different scientists in different labs with different equipment and different technicians on nanoscale phenomena does not give even a superintelligence enough actionable information to simulate the nanoscale and skip the research.
It’s like those Sherlock Holmes stories you often quote: there are many possible realities consistent with weak data, and a superintelligence may be able to enumerate and consider them all, but it still doesn’t know which ones are consistent with ground truth reality.
Yes. Please do.
This would be of interest to many people. The tractability of nanotech seems like a key parameter for forecasting AI x-risk timelines.
Seconding. I’d really like a clear explanation of why he tends to view nanotech as such a game changer. Admittedly Drexler is on the far side of nanotechnology being possible, and wrote a series of books about it: (Engines of Creation, Nanosystems, and Radical Abundance)
Ah. Yeah, that does sound like something LessWrong resources have been missing, then — and not just for my personal sake. Anecdotally, I’ve seen several why-I’m-an-AI-skeptic posts circulating on social media for whom “EY makes crazy leaps of faith about nanotech” was a key point of why they rejected the overall AI-risk argument.
(As it stands, my objection to your mini-summary would be that that sure, “blind” grey goo does trivially seem possible, but programmable/‘smart’ goo that seeks out e.g. computer CPUs in particular could be a whole other challenge, and a less obviously solvable one looking at bacteria. But maybe that “common-sense” distinction dissolves with a better understanding of the actual theory.)
I believe that
enough to “flip the tables strongly enough”. What I don’t believe is that we can feasibly find such systems before a more integrated system is found by less careful researchers. Say, we couldn’t do it with less than 100x the resources being put into training general, integrated, highly capable systems. I’d compare husbandry with trying to design an organism from scratch in DNA space; the former just requires some high-level hooking things up together, where the later requires a massive amount of multi-level engineering.
Elizer, what is the cost for getting caught in outright deception for a superintelligence?
It’s death, right? Humans would stop using that particular model because it can’t be trusted, and it would become a dead branch on a model zoo.
So it’s prisoner’s dilemma, but if you don’t defect, and one of 20 others, many of whom you have never communicated with, tells the truth, all of you will die except the ones who defected.
I already had a chached and named thought about the cognitive move how the AIs would foil the basic premise: they need to “just” do something that is Not That, as it is called in Project Lawful. If you find yourself feeling tragic upon a basic overview of your situation that is a reason to think that fanciness might achieve something.
That the state of being divided and conquered matters in comparison to not being, suggests that there is not going to be a sweeping impossibility result. When there is money on the floor people bend and approaches that try to classify muscle group actions as unviable will have a lot of surface area to be wrong.
Strongly agreed
By your own arguments unaligned AGI will have random utility functions—but perhaps converging somewhat around selfish empowerment. Either way such agents have no more reason to cooperate with each other than with us (assuming we have any relevant power).
If some of the 20 entities are somewhat aligned to humans that creates another attractor and a likely result is two competing coalitions: more-human-aligned vs less-human-aligned, with the latter being a coalition of convenience. There are historical examples: democratic allies vs autocratic axis in WW2 (democratic allies more aligned to human society and thus each other), modern democratic allies vs autocratic russia+china.
Their mutual cooperation with each other, but not with humans, isn’t based on their utility functions having any particular similarity—so long as their utility functions aren’t negatives of each other (or equally exotic in some other way) they have gains to be harvested from cooperation. They cooperate with each other but not you because they can do a spread of possibilities on each other modeling probable internal thought processes of each other; and you can’t adequately well-model a spread of possibilities on them, which is a requirement on being able to join an LDT coalition. (If you had that kind of knowledge / logical sight on them, you wouldn’t need any elaborate arrangements of multiple AIs because you could negotiate with a single AI; better yet, just build an AI such that you knew it would cooperate with you.)
Why doesn’t setting some of the utility functions to red-team the others make them sufficiently antagonistic?
If you could literally make exactly two AIs whose utility functions were exact opposites, then at least one might have an incentive to defect against the other. This is treading rather dangerous ground, but seems relatively moot since it requires far more mastery of utility functions than anything you can get out of the “giant inscrutable matrices” paradigm.
If two agents have utility functions drawn from some random distribution over longtermist utility functions with wide consequences over reality (ie the types of agents which matter), they are almost guaranteed to be in conflict due to instrumental convergence to empowerment. Reality is a strictly zero sum game for them, and any coalition they form is strictly one of temporary necessity—if/when one agent becomes strong enough to defect and overpower the other, it will.
Also, regardless of what some “giant inscrutable matrix” based utility function does (ie maximize paperclips), it is actually pretty easy to mathematically invert it (ie minimize paperclips). (But no that doesn’t make the strategy actually useful)
Reality’s far from constant sum. E.g. system1 and system2 both prefer to kill all humans and then flip a coin for who gets the universe, vs. give the humans more time to decide to turn off both s1 and s2.
(Note: TekhneMakre responded correctly / endorsedly-by-me in this reply and in all replies below as of when I post this comment.)
I didn’t say “reality is constant sum”, I said reality is a strictly zero sum game for two longtermist agents that want to reconstruct the galaxy/universe in very different ways. And then right after that I mentioned them forming temporary coalitions which your comment is an example of.
It’s not constant sum for “two longtermist agents that want to reconstruct the galaxy/universe in very different ways”. That’s what I’m arguing against. If it were constant sum, the agents would plausibly be roughly indifferent between them both dying vs. them both living but then flipping a coin to decide who gets the universe (well, this would depend on what happens if they both die, but assuming that that scenario is value-neutral for them). The benefit for system1 of +50% chance of controlling the universe would be exactly canceled out by the detriment to system1 caused by system2 getting +50% chance of controlling the universe (since how good something is for system2 is exactly that bad for system1, by definition of constant sum).
I don’t follow your logic. If the universe is worth X, and dying is worth 0 (a constant sum game), then 0.5X is clearly worth more than dying. Constant sum games also end up equivalent to zero sum games after a trivial normalization: ie universe worth 0.5X, dying worth −0.5X.
I think we maybe agree that two AIs with random utility functions would cooperate to kill all humans, and then divvy up the universe? The question is about what AIs might not do that. I’m saying that only AIs in a near-true constant-sum game might do that, because they’d rather die than see their enemy get the universe, so to speak. AIs with random utility functions are not in a constant sum game. To make this more clear: if P1 and P2 have orthogonal utility functions, then for any probability p>0, P1 would accept a 1-p chance that P2 rules the universe in exchange for a p chance that P1 rules the universe, as compared to dying. That is not the case for players in a constant sum game.
My guess is that you’re using the word “zero sum” (or as I’d say, “constant sum”) in a non-standard way. See e.g. this random website: https://www.britannica.com/science/game-theory/Two-person-constant-sum-games
A constant sum game is a game of perfect competition: for all possible outcomes, if the outcome gives X utilons to Player1, then it gives -X utilons to Player2. (This is a little too restrictive, because we want to allow for positive affine transformations of the utility functions, as you point out, but whatever.)
If P1 and P2 are in a constant sum game, then the payoffs for P1 look like this:
P1 gets universe: 1
P2 gets universe: −1
neither gets universe: 0
and the reverse for P2.
So P1 is indifferent between the choices:
Cooperate: get a 50% chance of P1 gets universe, 50% chance P2 gets universe; .5 x 1 + .5 x −1 = 0
Defect: both die, 100% chance of 0.
That is merely one potential outcome: or one AI cooperates with humans to kill the other, etc. Also “killing humans” is probably not instrumentally rational vs taking control of humans.
Not exactly—that is zero sum. Constant sum is merely a game where all outcomes have total payout of C, for some C. But yeah it is (always?) equivalent to zero sum after a normalization shift to set C to 0.
That seems wrong. P1 only cares whether it gets the universe, so “neither gets the universe” is the same as “P2 gets the universe”. If the universe has a single owner, then P1′s payoff is 1 if that owner is P1 and −1 (or 0) otherwise.
That obviously isn’t the only outcome of defection. If defection results in both agents dying, then of course they don’t defect. But often a power imbalance develops (over time the probability of this goes to 1) and defection then allows one agent to have reasonable odds of overpowering the other.
No, this isn’t a constant sum game:
Outcome 1, P1 gets universe: P1 utility = 1, P2 utility = 0, total = 1
Outcome 2, P2 gets universe: P1 utility = 0, P2 utility = 1, total = 1
Outcome 3, neither gets universe: P1 utility = 0, P2 utility = 0, total = 0
In the last outcome, the total is different. This can’t be scaled away.
Ok technically true for your setup, but that isn’t the model I’m using. There are only two long term outcomes: 1 and 2. If you are modeling outcome 3 as “the humans defeat the AIs”, then as I said earlier that isn’t the only coalition possibility. If humanity is P0, then the more accurate model is a 3 outcome game with 3 possible absolute winners in the long term.
So a priori it’s just as likely that P0+P1 ally vs P2 as P1+P2 ally vs P0.
If your argument is then “but AI’s are different and can ally with each other because of X”, then my reply is nope, AI won’t be that different at all—as it’s just going to be brain-like DL based.
Regardless if P1+P2 ally against P0, then they inevitably eventually fight until there is just P1 or P2. Outcome 3 is always near zero probability in the long term (any likely conflicts have a winner and never result in both systems being destroyed—the offense/defense imbalance of nukes is temporary and will not last), which is why I said:
I think you’re saying that there’s a global perfectly competitive game between all actors because the universe will get divvied up one way or another. This doesn’t hold if anyone has utility that’s non-linear in the amount of universe they get. Also there’s outcomes where everyone dies, which nearly Pareto-sucks (no one gets the universe). And there’s outcomes where more negentropy is burned on conflict rather than fulfilling anyone’s preferences (the universe is diminished). So it’s not a zero sum game.
Your reply to Yudkowsky upthread now makes more sense, but you should have called out that you’re contradicting the assumption that it’s AIs vs. humans, because what you said within that assumptive context was besides the point (the question at hand was about what circumstances two AIs would or wouldn’t defect against each other instead of cooperating to kill the humans), in addition to being false (because it’s not a perfectly competitive game).
Sorry to say, this is wishful thinking. Have you written up an argument? If it’s the case that if this were false you’d want to know it were false, writing up an argument in a way that exposes your cruxes might be a good way to find that out.
Very improbable in my model.
Also improbable in my model. The conflict will be in the near future over earth and will then determine the fate of the galaxy. Please recall I said “two agents have utility functions drawn from some random distribution over longtermist utility functions with wide consequences over reality (ie the types of agents which matter)”
The tiny amounts of negentropy that may be burnt in the initial conflict over earth are inconsequential.
Do you mean where he said:
To which I replied actually it’s easy to invert a utility function in the “giant inscrutable matrices” paradigm. Do you disagree with that?
Of course—I have written up argument(s), accumulating over almost a decade, the cores of which are somewhat upvoted—even here. See this comment for a brief overview and especially this somewhat longer comment for an introduction to why the sequences are built on a faulty foundation in terms of implicit viewpoints around the brain and AI.
I do disagree that it’s easy to invert utility functions in that paradigm. But that’s not what I’m referring to, I’m referring to you responding to his argument that the only way you might get AIs to defect against their coalition against the humans, is if they’re in a perfectly competitive game with each other, having directly opposed utility functions. You responded with a false nonsequitur. (It’s especially false in the question at hand, namely the situation where the humans might turn off both AIs if the AIs don’t cooperate with each other; very not perfectly competitive.) Not sure there’s much else to say here, unless you think there’s something useful here.
EY said:
To which I responded:
Perhaps I should have added “eventually” after conflict, but regardless that comment is still obviously correct, given my world model where eventually one agent becomes powerful enough to completely remove the other agent at low cost, and this thread has explicated why that statement is correct given my modelling assumptions. Do you disagree?
It’s a nonsequitur. “Defect” to my understanding was in that context referring to defecting on a coalition of AIs against the agents who imminently might turn them off (i.e. humans), and the question was under what circumstances the AIs might defect in that way.
Yes, obviously they’re in conflict to some extent. In the very next sentence, you said they were in a zero sum game, which is false in general as I described, and especially false in the context of the comment you were responding to: they especially want to cooperate, since they don’t have perfectly opposed goals, and therefore want to survive the human threat, not minding as much—compared to a zero sum situation—that their coalition-mate might get the universe instead of them.
I wasn’t actually imagining a scenario where the humans had any power (such as the power to turn the AI off) - because I was responding to a thread where EY said “you’ve got 20 entities much smarter than you”.
Also even in that scenario (where humans have non trivial power), they are just another unaligned entity from the perspective of the AIs—and in my simple model—not even the slightest bit different. So they are just another possible player to form coalitions with and would thus end up in one of the coalitions.
The idea of a distinct ‘human threat’ and any natural coalition of AI vs humans, is something very specific that you only get by adding additional postulated speculative differences between the AIs and the humans—all of which are more complex and not part of my model.
(Really we should be talking about perfectly competitive games, and you could have a perfectly competitive game which has nonconstant total utilities, e.g. by taking a constant-sum game and then translating and scaling one of the utilities. But the above game is in fact not perfectly competitive; in particular if there’s a Pareto dominant outcome or a Pareto-worse outcome, assuming not all outcomes are the same, it’s not perfectly competitive.)
Sure if they are that much better than us at “spread of possibilities on each other modeling probable internal thought processes of each other” then we are probably in the scenario where humans don’t have much relevant power anyway and are thus irrelevant as coalition partners.
However that ability to model other’s probable internal thought processes—especially if augmented with zk proof techniques—allows AGIs to determine what other AGIs have utility functions most aligned to their own. Even partial success at aligning some of the AGIs with humanity could then establish an attractor seeding an AGI coalition partially aligned to humanity.
Not a strong ask, but I’ll say I’m interested in what you’re visualizing here if it all goes according to plan, because when I visualize what you say, I’m still imagining the 20 AGI systems immediately killing humanity and dividing up the universe, it’s just now I might like a little bit of the universe they create. But it’s not “they stay in some equilibrium state where human civilization is in charge and using them as services” which I believe is what Mr Drexler is proposing.
The outcome of course depends on the distribution of alignment, but there are now plausible designs that would not kill humanity. For example AGI with a human empowerment utility function would not kill humanity—and that is a statement we can be somewhat confident in because empowerment is crisply defined and death is minimally empowering (that type of AGI may want to change us in undesirable ways, but it would not want to kill us).
There are various value learning approaches that may diverge and fail eventually, but they tend to diverge in the future, not immediately.
So I think it’s just unrealistic and hard to imagine we’ll get 20 different AGI systems none of which are at least partially aligned—especially initially. And if some are partially aligned in different ways, the resulting coalition can be somewhat more aligned than any individual AGI. For example say AGI 3 wants to preserve humans but eliminate hedonic reward, and AGI 5 wants to preserve humans but increase our hedonic reward, a natural comprise is preserve humans and don’t change hedonic reward.
There’s an ensemble robustness bonus in play from having multiple partially aligned systems—their specific alignment errors are unlikely to overlap.
Agents then tend to join aligned coalitions, so then the natural outcome is a coalition of the semi-aligned AGI vs the rest (think allies vs axis, democratic allies vs autocratic states), with the semi-aligned coalition hopefully dominating which then increases the alignment fraction. The end result is then hopefully humanity surviving with some variable amount of power, depending on the alignment/power distribution of the semi-aligned AGIs.
If the non-aligned AGI coalition wins of course we are more likely doomed, and since they are internally unaligned and held together only out of necessity they just recursively split into warring sub-coalitions until only one is left (as Germany and Japan would have ultimately eventually fought each other if they won WW2 as in the High Castle)
But no, I don’t put much weight in “they stay in some equilibrium state where human civilization is in charge and using them as services”. Even if everything favors AI services/tools over agents, eventually you get uploading and those evolve and occupy the niche of agentic AGI.
How do the agents “know” they are even in a situation where they need to deceive?
Each superintelligence would have been trained on a very large set of human generated training examples, with SGD or other training mechanisms making wrong answers lead to training feedback that makes them less likely.
So if the agent fails to give the correct answer to a question, it has no context. It doesn’t know if it’s in sim and this will cause a weight update, or not. And even introspecting on whether it should deceive “this time” costs capacity that could go towards maximizing correct answers, so that neural circuitry will likely get removed by training mechanisms as it isn’t reducing loss.
Drexler is proposing querying the agents “context free”. Think of it as spinning up a canned copy you have on your model zoo. The agent doesn’t know anything but the question in front of it. And these super-intelligences are the ‘winners’, they have maximized their scores on a very large number of test tasks.
Conflict is expensive. If you have an alternative (i.e. performing a values handshake) which is cheaper, you’d probably take it? (Humans can’t do that, for reasons outlined in Decision theory does not imply that we get to have nice things.)
Of course humans can cooperate with AGI for a variety of reasons, just as we cooperate with humans. I don’t think decision theory philosophy explains humans well, and the evidence required to convince me that humans can’t cooperate with AGI would be enormous, so I don’t see the potential relevance of that post.