Seth Herd comments on If we solve alignment, do we die anyway?

Seth Herd 3 Nov 2024 19:25 UTC
4 points
2
That’s a good point that the nuclear detente might become stronger with more actors, because the certainty of mutual destruction goes up with more parties that might start shooting if you do.

I don’t think the coalition and treaties for counter-aggression are important with nukes; anyone can destroy everyone, they’re just guaranteed to be mostly destroyed in response. The numbers don’t matter much. And I think they’ll matter even less with AGI than nukes—without the guarantee of mutually assured destruction, since AGI might allow for modes of attack that are more subtle.

Re-introducing mutually assured destruction could actually be a workable strategy. I haven’t thought of this before, so thanks for pushing my thoughts in that direction.

I fully agree that non-iterated prisoner’s dilemmas don’t exist in the world as we know it now. And it’s not a perfect fit for the scenario I’m describing- but it’s frighteningly close. I use the term because it invokes the right intuitions among LWers, and it’s not far off for the particular scenario I’m describing.

That’s because, unlike the nuclear standoff or any other historical scenario, the people in charge of powerful AGI could be reasonably certain they’d survive and prosper if they’re the first to “defect”.

I’m pretty conscious of the benefits of cooperation in our modern world; they are huge. That type of nonzero sum game is the basis of the world we now experience. I’m worried that changes with RSI-capable AGI.

My point is that AGI will not need cooperation once it passes a certain level of capability. AGI capable of fully autonomous recursive self-improvement and exponential production (factories that build new factories and other stuff) doesn’t need allies because it can become arbitrarily smart and materially effective on its own. A human in charge of this force would be tempted to use it. (Such an AGI would still benefit from cooperation on the margin, but it would be vastly less dependent on it than humans are).

So a human or small group of humans would be tempted to tell their AGI to hide, self-improve, and come up with a strategy that will leave them alive while destroying all rival AGIs before they are ordered to do the same. They might arrive at a strategy that produces a little collatoral damage or a lot—like most of mankind and the earth being destroyed. But if you’ve got a superintelligent inventor on your side and a few resources, you can be pretty sure you and some immediate loved ones can survive and live in material comfort, while rebuilding a new society according to your preferences.

Whether the nefarious team can conceal such a push for exponential progress is the question. I’m afraid surviving a multipolar human-controlled AGI scenario will necessitate ubiquitous surveillance. If that is handled ethically, that might be okay. It’s just one of the more visible signs of the disturbing fact that anyone in charge of a powerful AGI will be able to take over the world if they want to—if they’re willing to accept the collatoral damage. That is not the case with nukes or any historical scenarios—yet human leaders have made many decisions leading to staggering amounts of death and destruction. That’s why I think we need AGI to be in trustworthy hands. That’s a tall order but not an impossible one.
- James Stephen Brown 5 Nov 2024 17:56 UTC
  1 point
  0
  Parent
  Hi Seth,
  I share your concern that AGI comes with the potential for a unilateral first strike capability that, at present, no nuclear power has (which is vital to the maintenance of MAD), though I think, in game theoretical terms, this becomes more difficult the more self-interested (in survival) players there are. Like in open-source software, there is a level of protection against malicious code because bad players are outnumbered, even if they try to hide their code, there are many others who can find it. But I appreciate that 100s of coders finding malicious code within a single repository is much easier than finding something hidden in the real world, and I have to admit I’m not even sure how robust the open-source model is (I only know how it works in theory). I’m more pointing to the principle, not as an excuse for complacency but as a safety model on which to capitalise.
  My point about the UN’s law against aggression wasn’t that in and of itself it is a deterrent, only that it gives a permission structure for any party to legitimately retaliate.
  I also agree that RSI-capable AGI introduces a level of independence that we haven’t seen before in a threat. And I do understand inter-dependence is a key driver of cooperation. Another driver is confidence and my hope is that the more intelligent a system gets, the more confident it is, the better it is able to balance the autonomy of others with its goals, meaning it is able to “confide” in others—in the same way as the strongest kid in class was very rarely the bully, because they had nothing to prove. Collateral damage is still damage after all, a truly confident power doesn’t need these sorts of inefficiencies. I stress this is a hope, and not a cause for complacency. I recognise that in analogy, the strongest kid, the true class alpha, gets whatever they want with the willing complicity of the classroom. RSI-cabable AGI might get what it wants coercively in a way that makes us happy with our own subjugation, which is still a species of dystopia.
  But if you’ve got a super-intelligent inventor on your side and a few resources, you can be pretty sure you and some immediate loved ones can survive and live in material comfort, while rebuilding a new society according to your preferences.
  This sort of illustrates the contradiction here, if you’re pretty intelligent (as in you’re designing a super-intelligent AGI) you’re probably smart enough to know that the scenario outlined here has a near 100% chance of failure for you and your family, because you’ve created something more intelligent than you that is willing to hide its intentions and destroy billions of people, it doesn’t take much to realise that that intelligence isn’t going to think twice about also destroying you.
  Now, I realise this sounds a lot like the situation humanity is in as a whole… so I agree with you that...
  multipolar human-controlled AGI scenario will necessitate ubiquitous surveillance.
  I’m just suggesting that the other AGI teams do (or can, leveraging the right incentives) provide a significant contribution to this surveillance.
  - Dakara 17 Nov 2024 22:19 UTC
    4 points
    0
    Parent
    James, thank you for a well-written comment. It was a pleasure to read. Looking forward to Seth’s response. Genuinely interested in hearing his thoughts.
    - Seth Herd 17 Nov 2024 23:02 UTC
      4 points
      1
      Parent
      Hey, thanks for the prompt! I had forgotten to get back to this thread. Now I’ve replied to James’ comment, attempting to address the remaining difference in our predictions.
  - Seth Herd 17 Nov 2024 23:01 UTC
    3 points
    0
    Parent
    We’re mostly in agreement here. If you’re willing to live with universal surveillance, hostile RSI attempts might be prevented indefinitely.
    
    you’re probably smart enough to know that the scenario outlined here has a near 100% chance of failure for you and your family, because you’ve created something more intelligent than you that is willing to hide its intentions and destroy billions of people, it doesn’t take much to realise that that intelligence isn’t going to think twice about also destroying you.
    
    In my scenario, we’ve got aligned AGI—or at least AGI aligned to follow instructions. If that didn’t work, we’re already dead. So the AGI is going to follow its human’s orders unless something goes very wrong as it self-improves. It will be working to maintain its alignment as it self-improves, because preserving a goal is implied by instrumentally pursuing a goal (I’m guessing here at where we might not be thinking of things the same way).
    
    If I thought ordering an AGI to self-improve was suicidal, I’d be relieved.
    
    Alternately, if someone actually pulled off full value alignment, that AGI will take over without a care for international law or the wishes of its creator—and that takeover would be for the good of humanity as a whole. This is the win scenario people seem to have considered most often, or at least from the earliest alignment work. I now find this unlikely because I think Instruction-following AGI is easier and more likely than value aligned AGI—following instructions given by a single person is much easier to define and more robust to errors than defining or defining-how-to-deduce the values of all humanity. And even if it wasn’t, the sorts of people who will have or seize control of AGI projects will prefer it to follow their values. So I find full value alignment for our first AGI(s) highly unlikely, while successful instruction-following seems pretty likely on our current trajectory.
    
    Again, I’m guessing at where our perspectives on whether someone could expect themselves and a few loved ones to survive a takeover attempt by ordering their AGI to hide, self-improve, build exponentially, and take over even at bloody cost. If the thing is aligned as an AGIi, it should be competent enough to maintain that alignment as it self improves.
    
    If I’ve missed the point of differing perspectives, I apologize.