James Stephen Brown comments on If we solve alignment, do we die anyway?

James Stephen Brown 2 Nov 2024 18:11 UTC
2 points
1
The nuclear MAD standoff with nonproliferation agreements is fairly similar to the scenario I’ve described. We’ve survived that so far- but with only nine participants to date.
I wonder if there’s a clue in this. When you say “only” nine participants it suggests that more would introduce more risk, but that’s not what we’ve seen with MAD. The greater the number becomes, the bigger the deterrent gets. If, for a minute we forgo alliances, there is a natural alliance of “everyone else” at play when it comes to an aggressor. Military aggression is, after all, illegal. So, the greater the number of players, the smaller advantage any one aggressive player has against the natural coalition of all other peaceful players. If we take into account alliances, then this simply returns to a more binary question and the number of players makes no difference.
So, what happens if we apply this to an AGI scenario?

First I want to admit I’m immediately skeptical when anyone mentions a non-iterated Prisoner’s Dilemma playing out in the real world, because a Prisoner’s Dilemma requires extremely confined parameters, and ignores externalities that are present even in an actual prisoner’s dilemma (between two actual prisoners) in the real world. The world is a continuous game, and as such almost all games are iterated games.

If we take the AGI situation, we have an increasing number of players (as you mention “and N increasing”); different AGIs, different humans teams, and mixtures of AGI and human teams, all of which want to survive, some of which may want to dominate or eliminate all other teams. There is a natural coalition of teams that want to survive and don’t want to eliminate all other teams, and that coalition will always be larger and more distributed than the nefarious team that seeks to destroy them. We can observe such robustness in many distributed systems, that seem, on the face of it, vulnerable. This dynamic makes it increasingly difficult for the nefarious team to hide their activities, meanwhile the others are able to capitalise on the benefits of cooperation.
I think we discount the benefit of cooperation, because it’s so ubiquitous in our modern world. This ubiquity of cooperation is a product of a tendency in intelligent systems to evolve toward greater non-zero-sumness. While I share many reservations about AGI, when I remember this fact, I am somewhat reassured that, as our capability to destroy everything gets greater, this capacity is born out of our greater interconnectedness. It is our intelligence and rationality that allows us to harness the benefits of greater cooperation. So, I don’t see why greater rationality on the part of AGI should suddenly reverse this trend.
I don’t want to suggest that this is a non-problem, rather that an acknowledgement of these advantages might allow us to capitalise on them.
- Seth Herd 3 Nov 2024 19:25 UTC
  4 points
  2
  Parent
  That’s a good point that the nuclear detente might become stronger with more actors, because the certainty of mutual destruction goes up with more parties that might start shooting if you do.
  
  I don’t think the coalition and treaties for counter-aggression are important with nukes; anyone can destroy everyone, they’re just guaranteed to be mostly destroyed in response. The numbers don’t matter much. And I think they’ll matter even less with AGI than nukes—without the guarantee of mutually assured destruction, since AGI might allow for modes of attack that are more subtle.
  
  Re-introducing mutually assured destruction could actually be a workable strategy. I haven’t thought of this before, so thanks for pushing my thoughts in that direction.
  
  I fully agree that non-iterated prisoner’s dilemmas don’t exist in the world as we know it now. And it’s not a perfect fit for the scenario I’m describing- but it’s frighteningly close. I use the term because it invokes the right intuitions among LWers, and it’s not far off for the particular scenario I’m describing.
  
  That’s because, unlike the nuclear standoff or any other historical scenario, the people in charge of powerful AGI could be reasonably certain they’d survive and prosper if they’re the first to “defect”.
  
  I’m pretty conscious of the benefits of cooperation in our modern world; they are huge. That type of nonzero sum game is the basis of the world we now experience. I’m worried that changes with RSI-capable AGI.
  
  My point is that AGI will not need cooperation once it passes a certain level of capability. AGI capable of fully autonomous recursive self-improvement and exponential production (factories that build new factories and other stuff) doesn’t need allies because it can become arbitrarily smart and materially effective on its own. A human in charge of this force would be tempted to use it. (Such an AGI would still benefit from cooperation on the margin, but it would be vastly less dependent on it than humans are).
  
  So a human or small group of humans would be tempted to tell their AGI to hide, self-improve, and come up with a strategy that will leave them alive while destroying all rival AGIs before they are ordered to do the same. They might arrive at a strategy that produces a little collatoral damage or a lot—like most of mankind and the earth being destroyed. But if you’ve got a superintelligent inventor on your side and a few resources, you can be pretty sure you and some immediate loved ones can survive and live in material comfort, while rebuilding a new society according to your preferences.
  
  Whether the nefarious team can conceal such a push for exponential progress is the question. I’m afraid surviving a multipolar human-controlled AGI scenario will necessitate ubiquitous surveillance. If that is handled ethically, that might be okay. It’s just one of the more visible signs of the disturbing fact that anyone in charge of a powerful AGI will be able to take over the world if they want to—if they’re willing to accept the collatoral damage. That is not the case with nukes or any historical scenarios—yet human leaders have made many decisions leading to staggering amounts of death and destruction. That’s why I think we need AGI to be in trustworthy hands. That’s a tall order but not an impossible one.
  - James Stephen Brown 5 Nov 2024 17:56 UTC
    1 point
    0
    Parent
    Hi Seth,
    I share your concern that AGI comes with the potential for a unilateral first strike capability that, at present, no nuclear power has (which is vital to the maintenance of MAD), though I think, in game theoretical terms, this becomes more difficult the more self-interested (in survival) players there are. Like in open-source software, there is a level of protection against malicious code because bad players are outnumbered, even if they try to hide their code, there are many others who can find it. But I appreciate that 100s of coders finding malicious code within a single repository is much easier than finding something hidden in the real world, and I have to admit I’m not even sure how robust the open-source model is (I only know how it works in theory). I’m more pointing to the principle, not as an excuse for complacency but as a safety model on which to capitalise.
    My point about the UN’s law against aggression wasn’t that in and of itself it is a deterrent, only that it gives a permission structure for any party to legitimately retaliate.
    I also agree that RSI-capable AGI introduces a level of independence that we haven’t seen before in a threat. And I do understand inter-dependence is a key driver of cooperation. Another driver is confidence and my hope is that the more intelligent a system gets, the more confident it is, the better it is able to balance the autonomy of others with its goals, meaning it is able to “confide” in others—in the same way as the strongest kid in class was very rarely the bully, because they had nothing to prove. Collateral damage is still damage after all, a truly confident power doesn’t need these sorts of inefficiencies. I stress this is a hope, and not a cause for complacency. I recognise that in analogy, the strongest kid, the true class alpha, gets whatever they want with the willing complicity of the classroom. RSI-cabable AGI might get what it wants coercively in a way that makes us happy with our own subjugation, which is still a species of dystopia.
    But if you’ve got a super-intelligent inventor on your side and a few resources, you can be pretty sure you and some immediate loved ones can survive and live in material comfort, while rebuilding a new society according to your preferences.
    This sort of illustrates the contradiction here, if you’re pretty intelligent (as in you’re designing a super-intelligent AGI) you’re probably smart enough to know that the scenario outlined here has a near 100% chance of failure for you and your family, because you’ve created something more intelligent than you that is willing to hide its intentions and destroy billions of people, it doesn’t take much to realise that that intelligence isn’t going to think twice about also destroying you.
    Now, I realise this sounds a lot like the situation humanity is in as a whole… so I agree with you that...
    multipolar human-controlled AGI scenario will necessitate ubiquitous surveillance.
    I’m just suggesting that the other AGI teams do (or can, leveraging the right incentives) provide a significant contribution to this surveillance.
    - Dakara 17 Nov 2024 22:19 UTC
      4 points
      0
      Parent
      James, thank you for a well-written comment. It was a pleasure to read. Looking forward to Seth’s response. Genuinely interested in hearing his thoughts.
      - Seth Herd 17 Nov 2024 23:02 UTC
        4 points
        1
        Parent
        Hey, thanks for the prompt! I had forgotten to get back to this thread. Now I’ve replied to James’ comment, attempting to address the remaining difference in our predictions.
    - Seth Herd 17 Nov 2024 23:01 UTC
      3 points
      0
      Parent
      We’re mostly in agreement here. If you’re willing to live with universal surveillance, hostile RSI attempts might be prevented indefinitely.
      
      you’re probably smart enough to know that the scenario outlined here has a near 100% chance of failure for you and your family, because you’ve created something more intelligent than you that is willing to hide its intentions and destroy billions of people, it doesn’t take much to realise that that intelligence isn’t going to think twice about also destroying you.
      
      In my scenario, we’ve got aligned AGI—or at least AGI aligned to follow instructions. If that didn’t work, we’re already dead. So the AGI is going to follow its human’s orders unless something goes very wrong as it self-improves. It will be working to maintain its alignment as it self-improves, because preserving a goal is implied by instrumentally pursuing a goal (I’m guessing here at where we might not be thinking of things the same way).
      
      If I thought ordering an AGI to self-improve was suicidal, I’d be relieved.
      
      Alternately, if someone actually pulled off full value alignment, that AGI will take over without a care for international law or the wishes of its creator—and that takeover would be for the good of humanity as a whole. This is the win scenario people seem to have considered most often, or at least from the earliest alignment work. I now find this unlikely because I think Instruction-following AGI is easier and more likely than value aligned AGI—following instructions given by a single person is much easier to define and more robust to errors than defining or defining-how-to-deduce the values of all humanity. And even if it wasn’t, the sorts of people who will have or seize control of AGI projects will prefer it to follow their values. So I find full value alignment for our first AGI(s) highly unlikely, while successful instruction-following seems pretty likely on our current trajectory.
      
      Again, I’m guessing at where our perspectives on whether someone could expect themselves and a few loved ones to survive a takeover attempt by ordering their AGI to hide, self-improve, build exponentially, and take over even at bloody cost. If the thing is aligned as an AGIi, it should be competent enough to maintain that alignment as it self improves.
      
      If I’ve missed the point of differing perspectives, I apologize.