Eliezer Yudkowsky comments on Applying superintelligence without collusion

Eliezer Yudkowsky 8 Nov 2022 21:31 UTC
LW: 23 AF: 9
29
AF
Just to restate the standard argument against:
If you’ve got 20 entities much much smarter than you, and they can all get a better outcome by cooperating with each other, than they could if they all defected against each other, there is a certain hubris in imagining that you can get them to defect. They don’t want your own preferred outcome. Perhaps they will think of some strategy you did not, being much smarter than you, etc etc.
(Or, I mean, actually the strategy is “mutually cooperate”? Simulate a spread of the other possible entities, conditionally cooperate if their expected degree of cooperation goes over a certain threshold? Yes yes, more complicated in practice, but we don’t even, really, get to say that we were blindsided here. The mysterious incredibly clever strategy is just all 20 superintelligences deciding to do something else which isn’t mutual defection, despite the hopeful human saying, “But I set you up with circumstances that I thought would make you not decide that! How could you? Why? How could you just get a better outcome for yourselves like this?”)
What links here?
- dxu's comment on Eliezer Yudkowsky’s Letter in Time Magazine by Zvi (6 Apr 2023 21:17 UTC; 12 points)
- Eric Drexler 8 Nov 2022 21:49 UTC
  LW: 27 AF: 7
  8
  AF Parent
  I don’t see that as an argument [to narrow this a bit: not an argument relevant to what I propose]. As I noted above, Paul Christiano asks for explicit assumptions.
  To quote Paul again:.
  I think that concerns about collusion are relatively widespread amongst the minority of people most interested in AI control. And these concerns have in fact led to people dismissing many otherwise-promising approaches to AI control, so it is de facto an important question
  Dismissing promising approaches calls for something like a theorem, not handwaving about generic “smart entities”.
  [Perhaps too-pointed remark deleted]
  - Eliezer Yudkowsky 9 Nov 2022 0:52 UTC
    LW: 20 AF: 7
    8
    AF Parent
    I don’t think you’re going to see a formal proof, here; of course there exists some possible set of 20 superintelligences where one will defect against the others (though having that accomplish anything constructive for humanity is a whole different set of problems). It’s also true that there exists some possible set of 20 superintelligences all of which implement CEV and are cooperating with each other and with humanity, and some single superintelligence that implements CEV, and a possible superintelligence that firmly believes 222+222=555 without this leading to other consequences that would make it incoherent. Mind space is very wide, and just about everything that isn’t incoherent to imagine should exist as an actual possibility somewhere inside it. What we can access inside the subspace that looks like “giant inscrutable matrices trained by gradient descent”, before the world ends, is a harsher question.
    I could definitely buy that you could get some relatively cognitively weak AGI systems, produced by gradient descent on giant inscrutable matrices, to be in a state of noncooperation. The question then becomes, as always, what it is you plan to do with these weak AGI systems that will flip the tables strongly enough to prevent the world from being destroyed by stronger AGI systems six months later.
    What links here?
    JNS's comment on Truth and Advantage: Response to a draft of “AI safety seems hard to measure” by So8res (22 Mar 2023 8:30 UTC; 2 points)
    - Eric Drexler 9 Nov 2022 2:39 UTC
      LW: 46 AF: 17
      24
      AF Parent
      Mind space is very wide
      Yes, and the space of (what I would call) intelligent systems is far wider than the space of (what I would call) minds. To speak of “superintelligences” suggests that intelligence is a thing, like a mind, rather than a property, like prediction or problem-solving capacity. This is which is why I instead speak of the broader class of systems that perform tasks “at a superintelligent level”. We have different ontologies, and I suggest that a mind-centric ontology is too narrow.
      The most AGI-like systems we have today are LLMs, optimized for a simple prediction task. They can be viewed as simulators, but they have a peculiar relationship to agency:
      The simulation objective
      A simulator trained with machine learning is optimized to accurately model its training distribution – in contrast to, for instance, maximizing the output of a reward function or accomplishing objectives in an environment.… Optimizing toward the simulation objective notably does not incentivize instrumentally convergent behaviors the way that reward functions which evaluate trajectories do.
      LLMs have rich knowledge and capabilities, and can even simulate agents, yet they have no natural place in an agent-centric ontology. There’s an update to be had here (new information! fresh perspectives!) and much to reconsider.
      - quetzal_rainbow 9 Nov 2022 10:29 UTC
        1 point
        0
        Parent
        Does it make sense to talk about “(non)cooperating simulators”? Expected failure mode for simulators are more like exfo- and infohazards, like output to the query “print code for CEV-Sovereign” or “predict the future 10 years of my life”.
    - Eric Drexler 9 Nov 2022 8:43 UTC
      LW: 26 AF: 10
      10
      AF Parent
      The question then becomes, as always, what it is you plan to do with these weak AGI systems that will flip the tables strongly enough to prevent the world from being destroyed by stronger AGI systems six months later.
      Yes, this is the key question, and I think there’s a clear answer, at least in outline:
      What you call “weak” systems can nonetheless excel at time- and resource-bounded tasks in engineering design, strategic planning, and red-team/blue-team testing. I would recommend that we apply systems with focused capabilities along these lines to help us develop and deploy the physical basis for a defensively stable world — as you know, some extraordinarily capable technologies could be developed and deployed quite rapidly. In this scenario, defense has first move, can preemptively marshal arbitrarily large physical resources, and can restrict resources available to potential future adversaries. I would recommend investing resources in state-of-the-art hostile planning to support ongoing red-team/blue-team exercises.
      This isn’t “flipping the table”, it’s reinforcing the table and bolting it to the floor. What you call “strong” systems then can plan whatever they want, but with limited effect.
      - Eliezer Yudkowsky 10 Nov 2022 0:19 UTC
        LW: 32 AF: 13
        14
        AF Parent
        So I think that building nanotech good enough to flip the tables—which, I think, if you do the most alignable pivotal task, involves a simpler and less fraught task than “disassemble all GPUs”, which I choose not to name explicitly—is an engineering challenge where you get better survival chances (albeit still not good chances) by building one attemptedly-corrigible AGI that only thinks about nanotech and the single application of that nanotech, and is not supposed to think about AGI design, or the programmers, or other minds at all; so far as the best-attempt doomed system design goes, an imperfect-transparency alarm should have been designed to go off if your nanotech AGI is thinking about minds at all, human or AI, because it is supposed to just be thinking about nanotech. My guess is that you are much safer—albeit still doomed—if you try to do it the just-nanotech way, rather than constructing a system of AIs meant to spy on each other and sniff out each other’s deceptions; because, even leaving aside issues of their cooperation if they get generally-smart enough to cooperate, those AIs are thinking about AIs and thinking about other minds and thinking adversarially and thinking about deception. We would like to build an AI which does not start with any crystallized intelligence about these topics, attached to an alarm that goes off and tells us our foundational security assumptions have catastrophically failed and this course of research needs to be shut down if the AI starts to use fluid general intelligence to reason about those topics. (Not shut down the particular train of thought and keep going; then you just die as soon as the 20th such train of thought escapes detection.)
        astridain 20 Nov 2022 18:32 UTC
        7 points
        0
        Parent
        Hang on — how confident are you that this kind of nanotech is actually, physically possible? Why? In the past I’ve assumed that you used “nanotech” as a generic hypothetical example of technologies beyond our current understanding that an AGI could develop and use to alter the physical world very quickly. And it’s a fair one as far as that goes; a general intelligence will very likely come up with at least one thing as good as these hypothetical nanobots.
        But as a specific, practical plan for what to do with a narrow AI, this just seems like it makes a lot of specific unstated assumption about what you can in fact do with nanotech in particular. Plausibly the real technologies you’d need for a pivotal act can’t be designed without thinking about minds. How do we know otherwise? Why is that even a reasonable assumption?
        Eliezer Yudkowsky 22 Nov 2022 9:56 UTC
        32 points
        25
        Parent
        We maybe need an introduction to all the advance work done on nanotechnology for everyone who didn’t grow up reading “Engines of Creation” as a twelve-year-old or “Nanosystems” as a twenty-year-old. We basically know it’s possible; you can look at current biosystems and look at physics and do advance design work and get some pretty darned high confidence that you can make things with covalent-bonded molecules, instead of van-der-Waals folded proteins, that are to bacteria as airplanes to birds.
        For what it’s worth, I’m pretty sure the original author of this particular post happens to agree with me about this.
        Gerald Monroe 23 Feb 2023 2:02 UTC
        8 points
        0
        Parent
        Eliezer, you can discuss roadmaps to how one might actually build nanotechnology. You have the author of Nanosystems right here. What I think you get consistently wrong is you are missing all the intermediate incremental steps it would actually require, and the large amount of (probably robotic) “labor” it would take.
        A mess of papers published by different scientists in different labs with different equipment and different technicians on nanoscale phenomena does not give even a superintelligence enough actionable information to simulate the nanoscale and skip the research.
        It’s like those Sherlock Holmes stories you often quote: there are many possible realities consistent with weak data, and a superintelligence may be able to enumerate and consider them all, but it still doesn’t know which ones are consistent with ground truth reality.
        What links here?
        dxu's comment on Eliezer Yudkowsky’s Letter in Time Magazine by Zvi (6 Apr 2023 21:17 UTC; 12 points)
        Alexander Gietelink Oldenziel 22 Nov 2022 13:11 UTC
        4 points
        0
        Parent
        Yes. Please do.
        This would be of interest to many people. The tractability of nanotech seems like a key parameter for forecasting AI x-risk timelines.
        Noosphere89 22 Nov 2022 13:23 UTC
        4 points
        0
        Parent
        Seconding. I’d really like a clear explanation of why he tends to view nanotech as such a game changer. Admittedly Drexler is on the far side of nanotechnology being possible, and wrote a series of books about it: (Engines of Creation, Nanosystems, and Radical Abundance)
        astridain 22 Nov 2022 13:36 UTC
        2 points
        0
        Parent
        We maybe need an introduction to all the advance work done on nanotechnology for everyone who didn’t grow up reading “Engines of Creation” as a twelve-year-old or “Nanosystems” as a twenty-year-old.
        Ah. Yeah, that does sound like something LessWrong resources have been missing, then — and not just for my personal sake. Anecdotally, I’ve seen several why-I’m-an-AI-skeptic posts circulating on social media for whom “EY makes crazy leaps of faith about nanotech” was a key point of why they rejected the overall AI-risk argument.
        (As it stands, my objection to your mini-summary would be that that sure, “blind” grey goo does trivially seem possible, but programmable/‘smart’ goo that seeks out e.g. computer CPUs in particular could be a whole other challenge, and a less obviously solvable one looking at bacteria. But maybe that “common-sense” distinction dissolves with a better understanding of the actual theory.)
      - TekhneMakre 9 Nov 2022 15:04 UTC
        2 points
        0
        Parent
        I believe that
        “weak” systems can nonetheless excel at time- and resource-bounded tasks in engineering design, strategic planning, and red-team/blue-team testing
        enough to “flip the tables strongly enough”. What I don’t believe is that we can feasibly find such systems before a more integrated system is found by less careful researchers. Say, we couldn’t do it with less than 100x the resources being put into training general, integrated, highly capable systems. I’d compare husbandry with trying to design an organism from scratch in DNA space; the former just requires some high-level hooking things up together, where the later requires a massive amount of multi-level engineering.
    - Gerald Monroe 23 Feb 2023 2:30 UTC
      −2 points
      0
      Parent
      Elizer, what is the cost for getting caught in outright deception for a superintelligence?
      
      It’s death, right? Humans would stop using that particular model because it can’t be trusted, and it would become a dead branch on a model zoo.
      So it’s prisoner’s dilemma, but if you don’t defect, and one of 20 others, many of whom you have never communicated with, tells the truth, all of you will die except the ones who defected.
  - Slider 9 Nov 2022 10:57 UTC
    2 points
    0
    Parent
    I already had a chached and named thought about the cognitive move how the AIs would foil the basic premise: they need to “just” do something that is Not That, as it is called in Project Lawful. If you find yourself feeling tragic upon a basic overview of your situation that is a reason to think that fanciness might achieve something.
    That the state of being divided and conquered matters in comparison to not being, suggests that there is not going to be a sweeping impossibility result. When there is money on the floor people bend and approaches that try to classify muscle group actions as unviable will have a lot of surface area to be wrong.
  - David Johnston 9 Nov 2022 8:06 UTC
    1 point
    0
    Parent
    
    Dismissing promising approaches calls for something like a theorem, not handwaving about generic “smart entities”.
    
    Strongly agreed
- jacob_cannell 8 Nov 2022 22:17 UTC
  10 points
  6
  Parent
  
  If you’ve got 20 entities much much smarter than you, and they can all get a better outcome by cooperating with each other, than they could if they all defected against each other,
  
  By your own arguments unaligned AGI will have random utility functions—but perhaps converging somewhat around selfish empowerment. Either way such agents have no more reason to cooperate with each other than with us (assuming we have any relevant power).
  
  If some of the 20 entities are somewhat aligned to humans that creates another attractor and a likely result is two competing coalitions: more-human-aligned vs less-human-aligned, with the latter being a coalition of convenience. There are historical examples: democratic allies vs autocratic axis in WW2 (democratic allies more aligned to human society and thus each other), modern democratic allies vs autocratic russia+china.
  - Eliezer Yudkowsky 8 Nov 2022 23:07 UTC
    18 points
    16
    Parent
    Their mutual cooperation with each other, but not with humans, isn’t based on their utility functions having any particular similarity—so long as their utility functions aren’t negatives of each other (or equally exotic in some other way) they have gains to be harvested from cooperation. They cooperate with each other but not you because they can do a spread of possibilities on each other modeling probable internal thought processes of each other; and you can’t adequately well-model a spread of possibilities on them, which is a requirement on being able to join an LDT coalition. (If you had that kind of knowledge / logical sight on them, you wouldn’t need any elaborate arrangements of multiple AIs because you could negotiate with a single AI; better yet, just build an AI such that you knew it would cooperate with you.)
    What links here?
    Aaron_Scher's comment on Discussing how to align Transformative AI if it’s developed very soon by elifland (29 Nov 2022 1:36 UTC; 1 point)
    - PeterMcCluskey 9 Nov 2022 0:18 UTC
      4 points
      0
      Parent
      
      so long as their utility functions aren’t negatives of each other (or equally exotic in some other way)
      
      Why doesn’t setting some of the utility functions to red-team the others make them sufficiently antagonistic?
      - Eliezer Yudkowsky 9 Nov 2022 0:44 UTC
        4 points
        2
        Parent
        If you could literally make exactly two AIs whose utility functions were exact opposites, then at least one might have an incentive to defect against the other. This is treading rather dangerous ground, but seems relatively moot since it requires far more mastery of utility functions than anything you can get out of the “giant inscrutable matrices” paradigm.
        jacob_cannell 9 Nov 2022 1:45 UTC
        1 point
        −6
        Parent
        If two agents have utility functions drawn from some random distribution over longtermist utility functions with wide consequences over reality (ie the types of agents which matter), they are almost guaranteed to be in conflict due to instrumental convergence to empowerment. Reality is a strictly zero sum game for them, and any coalition they form is strictly one of temporary necessity—if/when one agent becomes strong enough to defect and overpower the other, it will.
        
        Also, regardless of what some “giant inscrutable matrix” based utility function does (ie maximize paperclips), it is actually pretty easy to mathematically invert it (ie minimize paperclips). (But no that doesn’t make the strategy actually useful)
        TekhneMakre 9 Nov 2022 15:21 UTC
        12 points
        19
        Parent
        Reality’s far from constant sum. E.g. system1 and system2 both prefer to kill all humans and then flip a coin for who gets the universe, vs. give the humans more time to decide to turn off both s1 and s2.
        Eliezer Yudkowsky 10 Nov 2022 0:22 UTC
        7 points
        2
        Parent
        (Note: TekhneMakre responded correctly / endorsedly-by-me in this reply and in all replies below as of when I post this comment.)
        jacob_cannell 9 Nov 2022 17:27 UTC
        0 points
        −2
        Parent
        I didn’t say “reality is constant sum”, I said reality is a strictly zero sum game for two longtermist agents that want to reconstruct the galaxy/universe in very different ways. And then right after that I mentioned them forming temporary coalitions which your comment is an example of.
        TekhneMakre 9 Nov 2022 17:33 UTC
        9 points
        18
        Parent
        It’s not constant sum for “two longtermist agents that want to reconstruct the galaxy/universe in very different ways”. That’s what I’m arguing against. If it were constant sum, the agents would plausibly be roughly indifferent between them both dying vs. them both living but then flipping a coin to decide who gets the universe (well, this would depend on what happens if they both die, but assuming that that scenario is value-neutral for them). The benefit for system1 of +50% chance of controlling the universe would be exactly canceled out by the detriment to system1 caused by system2 getting +50% chance of controlling the universe (since how good something is for system2 is exactly that bad for system1, by definition of constant sum).
        jacob_cannell 9 Nov 2022 18:00 UTC
        0 points
        0
        Parent
        I don’t follow your logic. If the universe is worth X, and dying is worth 0 (a constant sum game), then 0.5X is clearly worth more than dying. Constant sum games also end up equivalent to zero sum games after a trivial normalization: ie universe worth 0.5X, dying worth −0.5X.
        TekhneMakre 9 Nov 2022 18:10 UTC
        20 points
        18
        Parent
        I think we maybe agree that two AIs with random utility functions would cooperate to kill all humans, and then divvy up the universe? The question is about what AIs might not do that. I’m saying that only AIs in a near-true constant-sum game might do that, because they’d rather die than see their enemy get the universe, so to speak. AIs with random utility functions are not in a constant sum game. To make this more clear: if P1 and P2 have orthogonal utility functions, then for any probability p>0, P1 would accept a 1-p chance that P2 rules the universe in exchange for a p chance that P1 rules the universe, as compared to dying. That is not the case for players in a constant sum game.
        
        My guess is that you’re using the word “zero sum” (or as I’d say, “constant sum”) in a non-standard way. See e.g. this random website: https://www.britannica.com/science/game-theory/Two-person-constant-sum-games
        A constant sum game is a game of perfect competition: for all possible outcomes, if the outcome gives X utilons to Player1, then it gives -X utilons to Player2. (This is a little too restrictive, because we want to allow for positive affine transformations of the utility functions, as you point out, but whatever.)
        
        If P1 and P2 are in a constant sum game, then the payoffs for P1 look like this:
        P1 gets universe: 1
        P2 gets universe: −1
        neither gets universe: 0
        and the reverse for P2.
        So P1 is indifferent between the choices:
        Cooperate: get a 50% chance of P1 gets universe, 50% chance P2 gets universe; .5 x 1 + .5 x −1 = 0
        
        Defect: both die, 100% chance of 0.
        Expand this thread
        jacob_cannell 9 Nov 2022 19:01 UTC
        2 points
        −1
        Parent
        
        I think we maybe agree that two AIs with random utility functions would cooperate to kill all humans, and then divvy up the universe?
        
        That is merely one potential outcome: or one AI cooperates with humans to kill the other, etc. Also “killing humans” is probably not instrumentally rational vs taking control of humans.
        
        A constant sum game is a game of perfect competition: for all possible outcomes, if the outcome gives X utilons to Player1, then it gives -X utilons to Player2
        
        Not exactly—that is zero sum. Constant sum is merely a game where all outcomes have total payout of C, for some C. But yeah it is (always?) equivalent to zero sum after a normalization shift to set C to 0.
        
        If P1 and P2 are in a constant sum game, then the payoffs for P1 look like this: P1 gets universe: 1 P2 gets universe: −1 neither gets universe: 0
        
        That seems wrong. P1 only cares whether it gets the universe, so “neither gets the universe” is the same as “P2 gets the universe”. If the universe has a single owner, then P1′s payoff is 1 if that owner is P1 and −1 (or 0) otherwise.
        
        Defect: both die, 100% chance of 0.
        
        That obviously isn’t the only outcome of defection. If defection results in both agents dying, then of course they don’t defect. But often a power imbalance develops (over time the probability of this goes to 1) and defection then allows one agent to have reasonable odds of overpowering the other.
        TekhneMakre 9 Nov 2022 19:04 UTC
        4 points
        2
        Parent
        P1 only cares whether it gets the universe, so “neither gets the universe” is the same as “P2 gets the universe”. If the universe has a single owner, then P1′s payoff is 1 if that owner is P1 and −1 (or 0) otherwise.
        No, this isn’t a constant sum game:
        Outcome 1, P1 gets universe: P1 utility = 1, P2 utility = 0, total = 1
        Outcome 2, P2 gets universe: P1 utility = 0, P2 utility = 1, total = 1
        Outcome 3, neither gets universe: P1 utility = 0, P2 utility = 0, total = 0
        In the last outcome, the total is different. This can’t be scaled away.
        jacob_cannell 9 Nov 2022 20:01 UTC
        2 points
        0
        Parent
        Ok technically true for your setup, but that isn’t the model I’m using. There are only two long term outcomes: 1 and 2. If you are modeling outcome 3 as “the humans defeat the AIs”, then as I said earlier that isn’t the only coalition possibility. If humanity is P0, then the more accurate model is a 3 outcome game with 3 possible absolute winners in the long term.
        
        So a priori it’s just as likely that P0+P1 ally vs P2 as P1+P2 ally vs P0.
        
        If your argument is then “but AI’s are different and can ally with each other because of X”, then my reply is nope, AI won’t be that different at all—as it’s just going to be brain-like DL based.
        
        Regardless if P1+P2 ally against P0, then they inevitably eventually fight until there is just P1 or P2. Outcome 3 is always near zero probability in the long term (any likely conflicts have a winner and never result in both systems being destroyed—the offense/defense imbalance of nukes is temporary and will not last), which is why I said:
        
        any coalition they form is strictly one of temporary necessity—if/when one agent becomes strong enough to defect and overpower the other, it will.
        
        TekhneMakre 9 Nov 2022 20:24 UTC
        6 points
        4
        Parent
        I think you’re saying that there’s a global perfectly competitive game between all actors because the universe will get divvied up one way or another. This doesn’t hold if anyone has utility that’s non-linear in the amount of universe they get. Also there’s outcomes where everyone dies, which nearly Pareto-sucks (no one gets the universe). And there’s outcomes where more negentropy is burned on conflict rather than fulfilling anyone’s preferences (the universe is diminished). So it’s not a zero sum game.
        Your reply to Yudkowsky upthread now makes more sense, but you should have called out that you’re contradicting the assumption that it’s AIs vs. humans, because what you said within that assumptive context was besides the point (the question at hand was about what circumstances two AIs would or wouldn’t defect against each other instead of cooperating to kill the humans), in addition to being false (because it’s not a perfectly competitive game).
        nope, AI won’t be that different at all—as it’s just going to be brain-like DL based.
        Sorry to say, this is wishful thinking. Have you written up an argument? If it’s the case that if this were false you’d want to know it were false, writing up an argument in a way that exposes your cruxes might be a good way to find that out.
        jacob_cannell 9 Nov 2022 20:56 UTC
        2 points
        0
        Parent
        
        Also there’s outcomes where everyone dies,
        
        Very improbable in my model.
        
        And there’s outcomes where more negentropy is burned on conflict rather than fulfilling everyone’s preferences (the universe is diminished). So it’s not a zero sum game.
        
        Also improbable in my model. The conflict will be in the near future over earth and will then determine the fate of the galaxy. Please recall I said “two agents have utility functions drawn from some random distribution over longtermist utility functions with wide consequences over reality (ie the types of agents which matter)”
        
        The tiny amounts of negentropy that may be burnt in the initial conflict over earth are inconsequential.
        
        Your reply to Yudkowsky upthread now makes more sense,
        
        Do you mean where he said:
        
        If you could literally make exactly two AIs whose utility functions were exact opposites, then at least one might have an incentive to defect against the other. This is treading rather dangerous ground, but seems relatively moot since it requires far more mastery of utility functions than anything you can get out of the “giant inscrutable matrices” paradigm.
        
        To which I replied actually it’s easy to invert a utility function in the “giant inscrutable matrices” paradigm. Do you disagree with that?
        
        AI won’t be that different at all—as it’s just going to be brain-like DL based.
        
        Have you written up an argument?
        
        Of course—I have written up argument(s), accumulating over almost a decade, the cores of which are somewhat upvoted—even here. See this comment for a brief overview and especially this somewhat longer comment for an introduction to why the sequences are built on a faulty foundation in terms of implicit viewpoints around the brain and AI.
        TekhneMakre 9 Nov 2022 21:08 UTC
        2 points
        0
        Parent
        Do you disagree with that?
        I do disagree that it’s easy to invert utility functions in that paradigm. But that’s not what I’m referring to, I’m referring to you responding to his argument that the only way you might get AIs to defect against their coalition against the humans, is if they’re in a perfectly competitive game with each other, having directly opposed utility functions. You responded with a false nonsequitur. (It’s especially false in the question at hand, namely the situation where the humans might turn off both AIs if the AIs don’t cooperate with each other; very not perfectly competitive.) Not sure there’s much else to say here, unless you think there’s something useful here.
        jacob_cannell 9 Nov 2022 21:44 UTC
        2 points
        −1
        Parent
        EY said:
        
        If you could literally make exactly two AIs whose utility functions were exact opposites, then at least one might have an incentive to defect against the other.
        
        To which I responded:
        
        If two agents have utility functions drawn from some random distribution over longtermist utility functions with wide consequences over reality (ie the types of agents which matter), they are almost guaranteed to be in conflict due to instrumental convergence to empowerment.
        
        Perhaps I should have added “eventually” after conflict, but regardless that comment is still obviously correct, given my world model where eventually one agent becomes powerful enough to completely remove the other agent at low cost, and this thread has explicated why that statement is correct given my modelling assumptions. Do you disagree?
        TekhneMakre 9 Nov 2022 21:49 UTC
        4 points
        0
        Parent
        It’s a nonsequitur. “Defect” to my understanding was in that context referring to defecting on a coalition of AIs against the agents who imminently might turn them off (i.e. humans), and the question was under what circumstances the AIs might defect in that way.
        
        Yes, obviously they’re in conflict to some extent. In the very next sentence, you said they were in a zero sum game, which is false in general as I described, and especially false in the context of the comment you were responding to: they especially want to cooperate, since they don’t have perfectly opposed goals, and therefore want to survive the human threat, not minding as much—compared to a zero sum situation—that their coalition-mate might get the universe instead of them.
        jacob_cannell 9 Nov 2022 22:50 UTC
        2 points
        0
        Parent
        I wasn’t actually imagining a scenario where the humans had any power (such as the power to turn the AI off) - because I was responding to a thread where EY said “you’ve got 20 entities much smarter than you”.
        
        Also even in that scenario (where humans have non trivial power), they are just another unaligned entity from the perspective of the AIs—and in my simple model—not even the slightest bit different. So they are just another possible player to form coalitions with and would thus end up in one of the coalitions.
        
        The idea of a distinct ‘human threat’ and any natural coalition of AI vs humans, is something very specific that you only get by adding additional postulated speculative differences between the AIs and the humans—all of which are more complex and not part of my model.
        TekhneMakre 9 Nov 2022 19:24 UTC
        2 points
        0
        Parent
        (Really we should be talking about perfectly competitive games, and you could have a perfectly competitive game which has nonconstant total utilities, e.g. by taking a constant-sum game and then translating and scaling one of the utilities. But the above game is in fact not perfectly competitive; in particular if there’s a Pareto dominant outcome or a Pareto-worse outcome, assuming not all outcomes are the same, it’s not perfectly competitive.)
    - jacob_cannell 9 Nov 2022 0:16 UTC
      2 points
      0
      Parent
      Sure if they are that much better than us at “spread of possibilities on each other modeling probable internal thought processes of each other” then we are probably in the scenario where humans don’t have much relevant power anyway and are thus irrelevant as coalition partners.
      
      However that ability to model other’s probable internal thought processes—especially if augmented with zk proof techniques—allows AGIs to determine what other AGIs have utility functions most aligned to their own. Even partial success at aligning some of the AGIs with humanity could then establish an attractor seeding an AGI coalition partially aligned to humanity.
      - Ben Pace 9 Nov 2022 0:23 UTC
        4 points
        0
        Parent
        However that ability to model other’s probable internal thought processes—especially if augmented with zk proof techniques—allows AGIs to determine what other AGIs have utility functions most aligned to their own. Even partial success at aligning some of the AGIs with humanity could then establish an attractor seeding an AGI coalition partially aligned to humanity.
        Not a strong ask, but I’ll say I’m interested in what you’re visualizing here if it all goes according to plan, because when I visualize what you say, I’m still imagining the 20 AGI systems immediately killing humanity and dividing up the universe, it’s just now I might like a little bit of the universe they create. But it’s not “they stay in some equilibrium state where human civilization is in charge and using them as services” which I believe is what Mr Drexler is proposing.
        jacob_cannell 9 Nov 2022 1:33 UTC
        7 points
        2
        Parent
        The outcome of course depends on the distribution of alignment, but there are now plausible designs that would not kill humanity. For example AGI with a human empowerment utility function would not kill humanity—and that is a statement we can be somewhat confident in because empowerment is crisply defined and death is minimally empowering (that type of AGI may want to change us in undesirable ways, but it would not want to kill us).
        
        There are various value learning approaches that may diverge and fail eventually, but they tend to diverge in the future, not immediately.
        
        So I think it’s just unrealistic and hard to imagine we’ll get 20 different AGI systems none of which are at least partially aligned—especially initially. And if some are partially aligned in different ways, the resulting coalition can be somewhat more aligned than any individual AGI. For example say AGI 3 wants to preserve humans but eliminate hedonic reward, and AGI 5 wants to preserve humans but increase our hedonic reward, a natural comprise is preserve humans and don’t change hedonic reward.
        
        There’s an ensemble robustness bonus in play from having multiple partially aligned systems—their specific alignment errors are unlikely to overlap.
        
        Agents then tend to join aligned coalitions, so then the natural outcome is a coalition of the semi-aligned AGI vs the rest (think allies vs axis, democratic allies vs autocratic states), with the semi-aligned coalition hopefully dominating which then increases the alignment fraction. The end result is then hopefully humanity surviving with some variable amount of power, depending on the alignment/power distribution of the semi-aligned AGIs.
        
        If the non-aligned AGI coalition wins of course we are more likely doomed, and since they are internally unaligned and held together only out of necessity they just recursively split into warring sub-coalitions until only one is left (as Germany and Japan would have ultimately eventually fought each other if they won WW2 as in the High Castle)
        
        But no, I don’t put much weight in “they stay in some equilibrium state where human civilization is in charge and using them as services”. Even if everything favors AI services/tools over agents, eventually you get uploading and those evolve and occupy the niche of agentic AGI.
    - Gerald Monroe 23 Feb 2023 2:07 UTC
      0 points
      −1
      Parent
      How do the agents “know” they are even in a situation where they need to deceive?
      
      Each superintelligence would have been trained on a very large set of human generated training examples, with SGD or other training mechanisms making wrong answers lead to training feedback that makes them less likely.
      So if the agent fails to give the correct answer to a question, it has no context. It doesn’t know if it’s in sim and this will cause a weight update, or not. And even introspecting on whether it should deceive “this time” costs capacity that could go towards maximizing correct answers, so that neural circuitry will likely get removed by training mechanisms as it isn’t reducing loss.
      Drexler is proposing querying the agents “context free”. Think of it as spinning up a canned copy you have on your model zoo. The agent doesn’t know anything but the question in front of it. And these super-intelligences are the ‘winners’, they have maximized their scores on a very large number of test tasks.
  - RobertM 8 Nov 2022 23:07 UTC
    3 points
    2
    Parent
    Either way such agents have no more reason to cooperate with each other than with us (assuming we have any relevant power).
    Conflict is expensive. If you have an alternative (i.e. performing a values handshake) which is cheaper, you’d probably take it? (Humans can’t do that, for reasons outlined in Decision theory does not imply that we get to have nice things.)
    - jacob_cannell 9 Nov 2022 17:39 UTC
      1 point
      1
      Parent
      Of course humans can cooperate with AGI for a variety of reasons, just as we cooperate with humans. I don’t think decision theory philosophy explains humans well, and the evidence required to convince me that humans can’t cooperate with AGI would be enormous, so I don’t see the potential relevance of that post.

Eliezer Yudkowsky comments on Applying superintelligence without collusion

The simulation objective