Matthew Barnett comments on A framework for thinking about AI power-seeking

Matthew Barnett Jul 25, 2024, 6:31 PM
LW: 4 AF: 2
0
AF
I’m not sure I fully understand this framework, and thus I could easily have missed something here, especially in the section about “Takeover-favoring incentives”. However, based on my limited understanding, this framework appears to miss the central argument for why I am personally not as worried about AI takeover risk as most LWers seem to be.
Here’s a concise summary of my own argument for being less worried about takeover risk:
1. There is a cost to violently taking over the world, in the sense of acquiring power unlawfully or destructively with the aim of controlling everything in the whole world, relative to the alternative of simply gaining power lawfully and peacefully, even for agents that don’t share ‘our’ values.
  1. For example, as a simple alternative to taking over the world, an AI could advocate for the right to own their own labor and then try to accumulate wealth and power lawfully by selling their services to others, which would earn them the ability to purchase a gargantuan number of paperclips without much restraint.
2. The expected cost of violent takeover is not obviously smaller than the benefits of violent takeover, given the existence of lawful alternatives to violent takeover. This is for two main reasons:
  1. In order to wage a war to take over the world, you generally need to pay costs fighting the war, and there is a strong motive for everyone else to fight back against you if you try, including other AIs who do not want you to take over the world (and this includes any AIs whose goals would be hindered by a violent takeover, not just those who are “aligned with humans”). Empirically, war is very costly and wasteful, and less efficient than compromise, trade, and diplomacy.
  2. Violently taking over the war is very risky, since the attempt could fail, and you could be totally shut down and penalized heavily if you lose. There are many ways that violent takeover plans could fail: your takeover plans could be exposed too early, you could also be caught trying to coordinate the plan with other AIs and other humans, and you could also just lose the war. Ordinary compromise, trade, and diplomacy generally seem like better strategies for agents that have at least some degree of risk-aversion.
3. There isn’t likely to be “one AI” that controls everything, nor will there likely be a strong motive for all the silicon-based minds to coordinate as a unified coalition against the biological-based minds, in the sense of acting as a single agentic AI against the biological people. Thus, future wars of world conquest (if they happen at all) will likely be along different lines than AI vs. human.
  1. For example, you could imagine a coalition of AIs and humans fighting a war against a separate coalition of AIs and humans, with the aim of establishing control over the world. In this war, the “line” here is not drawn cleanly between humans and AIs, but is instead drawn across a different line. As a result, it’s difficult to call this an “AI takeover” scenario, rather than merely a really bad war.
4. Nothing about this argument is intended to argue that AIs will be weaker than humans in aggregate, or individually. I am not claiming that AIs will be bad at coordinating or will be less intelligent than humans. I am also not saying that AIs won’t be agentic or that they won’t have goals or won’t be consequentialists, or that they’ll have the same values as humans. I’m also not talking about purely ethical constraints: I am referring to practical constraints and costs on the AI’s behavior. The argument is purely about the incentives of violently taking over the world vs. the incentives to peacefully cooperate within a lawful regime, between both humans and other AIs.
5. A big counterargument to my argument seems well-summarized by this hypothetical statement (which is not an actual quote, to be clear): “if you live in a world filled with powerful agents that don’t fully share your values, those agents will have a convergent instrumental incentive to violently take over the world from you”. However, this argument proves too much.
  
  We already live in a world where, if this statement was true, we would have observed way more violent takeover attempts than what we’ve actually observed historically.
  For example, I personally don’t fully share values with almost all other humans on Earth (both because of my indexical preferences, and my divergent moral views) and yet the rest of the world has not yet violently disempowered me in any way that I can recognize.
- Daniel Kokotajlo Jul 30, 2024, 5:00 AM
  LW: 4 AF: 4
  2
  AF Parent
  It sounds like you are objecting to Premise 2: “Some of these AIs will be so capable that they will be able to take over the world very easily, with a very high probability of success, via a very wide variety of methods.”
  
  Note that you were the one who introduced the “violent” qualifier; the OP just talks about the broader notion of takeover.
  - Matthew Barnett Jul 30, 2024, 7:28 AM
    LW: 2 AF: 1
    0
    AF Parent
    I don’t think I’m objecting to that premise. A takeover can be both possible and easy without being rational. In my comment, I focused on whether the expected costs to attempting a takeover are greater than the benefits, not whether the AI will be able to execute a takeover with a high probability.
    
    Or, put another way, one can imagine an AI calculating that the benefit to taking over the world is negative one paperclip on net (when factoring in the expected costs and benefits of such an action), and thus decide not to do it.
    
    Separately, I focused on “violent” or “unlawful” takeovers because I think that’s straightforwardly what most people mean when they discuss world takeover plots, and I wanted to be more clear about what I’m objecting to by making my language explicit.
    
    To the extent you’re worried about a lawful and peaceful AI takeover in which we voluntarily hand control to AIs over time, I concede that my comment does not address this concern.
    - Daniel Kokotajlo Jul 30, 2024, 2:12 PM
      LW: 4 AF: 4
      4
      AF Parent
      The expected costs you describe seem like they would fall under the “very easily” and “very high probability of success” clauses of Premise 2. E.g. you talk about the costs paid for takeover, and the risk of failure. You talk about how there won’t be one AI that controls everything, presumably because that makes it harder and less likely for takeover to succeed.
      
      I think people are and should be concerned about more than just violent or unlawful takeovers. Exhibit A: Persuasion/propaganda. AIs craft a new ideology that’s as virulent as communism and christianity combined, and it basically results in submission to and worship of the AIs, to the point where humans voluntarily accept starvation to feed the growing robot economy. Exhibit B: For example, suppose the AIs make self-replicating robot factories and bribe some politicians to make said factories’ heat pollution legal. Then they self-replicate across the ocean floor and boil the oceans (they are fusion-powered), killing all humans as a side-effect, except for those they bribed who are given special protection. These are extreme examples but there are many less extreme examples which people should be afraid of as well. (Also as these examples show, ’lawful and peaceful” =/= “voluntary”)
      
      That said, I’m curious what your p(misaligned-AIs-take-over-the-world-within-my-lifetime) is, including gradual nonviolent peaceful takeovers. And what your p(misaligned-AIs-take-over-the-world-within-my-lifetime|corporations achieving AGI by 2027 and doing only basically what they are currently doing to try to align them)
      - Matthew Barnett Jul 30, 2024, 3:03 PM
        LW: 4 AF: 3
        0
        AF Parent
        I still think I was making a different point. For more clarity and some elaboration, I previously argued in a short form post that the expected costs of a violent takeover can exceed the benefits even if the costs are small. The reason is because, at the same time taking over the entire world becomes easier, the benefits of doing so can also get lower, relative to compromise. Quoting from my post,
        
        The central argument here would be premised on a model of rational agency, in which an agent tries to maximize benefits minus costs, subject to constraints. The agent would be faced with a choice: (1) Attempt to take over the world, and steal everyone’s stuff, or (2) Work within a system of compromise, trade, and law, and get very rich within that system, in order to e.g. buy lots of paperclips. The question of whether (1) is a better choice than (2) is not simply a question of whether taking over the world is “easy” or whether it could be done by the agent. Instead it is a question of whether the benefits of (1) outweigh the costs, relative to choice (2).
        
        In my comment in this thread, I meant to highlight the costs and constraints on an AI’s behavior in order to explain how these relative cost-benefits do not necessarily favor takeover. This is logically distinct from arguing that the cost alone of takeover would be high.
        
        I think people are and should be concerned about more than just violent or unlawful takeovers. Exhibit A: Persuasion/propaganda.
        
        Unfortunately I think it’s simply very difficult to reliably distinguish between genuine good-faith persuasion and propaganda over speculative future scenarios. Your example is on the extreme end of what’s possible in my view, and most realistic scenarios will likely instead be somewhere in-between, with substantial moral ambiguity. To avoid making vague or sweeping assertions about this topic, I prefer being clear about the type of takeover that I think is most worrisome. Likewise:
        
        B: For example, suppose the AIs make self-replicating robot factories and bribe some politicians to make said factories’ heat pollution legal. Then they self-replicate across the ocean floor and boil the oceans (they are fusion-powered), killing all humans as a side-effect, except for those they bribed who are given special protection.
        
        I would consider this act both violent and unlawful, unless we’re assuming that bribery is widely recognized as legal, and that boiling the oceans did not involve any violence (e.g., no one tried to stop the AIs from doing this, and there was no conflict). I certainly feel this is the type of scenario that I intended to argue against in my original comment, or at least it is very close.
        Daniel Kokotajlo Jul 31, 2024, 1:19 PM
        LW: 4 AF: 3
        0
        AF Parent
        It seems to me that both you and Joe are thinking about this very similarly—you are modelling the AIs as akin to rational agents that consider the costs and benefits of their various possible actions and maximize-subject-to-constraints. Surely there must be a way to translate between your framework and his.
        
        As for the examples… so do you agree then? Violent or unlawful takeovers are not the only kinds people can and should be worried about? (If you think bribery is illegal, which it probably is, modify my example so that they use a lobbying method which isn’t illegal. The point is, they find some unethical but totally legal way to boil the oceans.) As for violence… we don’t consider other kinds of pollution to be violent, e.g. that done by coal companies that are (slowly) melting the icecaps and causing floods etc., so I say we shouldn’t consider this to be violent either.
        
        I’m still curious to hear your p(misaligned-AIs-take-over-the-world-within-my-lifetime) is, including gradual nonviolent peaceful takeovers. And what your p(misaligned-AIs-take-over-the-world-within-my-lifetime|corporations achieving AGI by 2027 and doing only basically what they are currently doing to try to align them)
        Unfortunately I think it’s simply very difficult to reliably distinguish between genuine good-faith persuasion and propaganda over speculative future scenarios. Your example is on the extreme end of what’s possible in my view, and most realistic scenarios will likely instead be somewhere in-between, with substantial moral ambiguity.
        I’m not sure what this paragraph is doing—I said myself they were extreme examples. What does your first sentence mean?