James Stephen Brown comments on If we solve alignment, do we die anyway?

James Stephen Brown 5 Nov 2024 17:56 UTC
1 point
0
Hi Seth,
I share your concern that AGI comes with the potential for a unilateral first strike capability that, at present, no nuclear power has (which is vital to the maintenance of MAD), though I think, in game theoretical terms, this becomes more difficult the more self-interested (in survival) players there are. Like in open-source software, there is a level of protection against malicious code because bad players are outnumbered, even if they try to hide their code, there are many others who can find it. But I appreciate that 100s of coders finding malicious code within a single repository is much easier than finding something hidden in the real world, and I have to admit I’m not even sure how robust the open-source model is (I only know how it works in theory). I’m more pointing to the principle, not as an excuse for complacency but as a safety model on which to capitalise.
My point about the UN’s law against aggression wasn’t that in and of itself it is a deterrent, only that it gives a permission structure for any party to legitimately retaliate.
I also agree that RSI-capable AGI introduces a level of independence that we haven’t seen before in a threat. And I do understand inter-dependence is a key driver of cooperation. Another driver is confidence and my hope is that the more intelligent a system gets, the more confident it is, the better it is able to balance the autonomy of others with its goals, meaning it is able to “confide” in others—in the same way as the strongest kid in class was very rarely the bully, because they had nothing to prove. Collateral damage is still damage after all, a truly confident power doesn’t need these sorts of inefficiencies. I stress this is a hope, and not a cause for complacency. I recognise that in analogy, the strongest kid, the true class alpha, gets whatever they want with the willing complicity of the classroom. RSI-cabable AGI might get what it wants coercively in a way that makes us happy with our own subjugation, which is still a species of dystopia.
But if you’ve got a super-intelligent inventor on your side and a few resources, you can be pretty sure you and some immediate loved ones can survive and live in material comfort, while rebuilding a new society according to your preferences.
This sort of illustrates the contradiction here, if you’re pretty intelligent (as in you’re designing a super-intelligent AGI) you’re probably smart enough to know that the scenario outlined here has a near 100% chance of failure for you and your family, because you’ve created something more intelligent than you that is willing to hide its intentions and destroy billions of people, it doesn’t take much to realise that that intelligence isn’t going to think twice about also destroying you.
Now, I realise this sounds a lot like the situation humanity is in as a whole… so I agree with you that...
multipolar human-controlled AGI scenario will necessitate ubiquitous surveillance.
I’m just suggesting that the other AGI teams do (or can, leveraging the right incentives) provide a significant contribution to this surveillance.
- Dakara 17 Nov 2024 22:19 UTC
  4 points
  0
  Parent
  James, thank you for a well-written comment. It was a pleasure to read. Looking forward to Seth’s response. Genuinely interested in hearing his thoughts.
  - Seth Herd 17 Nov 2024 23:02 UTC
    4 points
    1
    Parent
    Hey, thanks for the prompt! I had forgotten to get back to this thread. Now I’ve replied to James’ comment, attempting to address the remaining difference in our predictions.
- Seth Herd 17 Nov 2024 23:01 UTC
  3 points
  0
  Parent
  We’re mostly in agreement here. If you’re willing to live with universal surveillance, hostile RSI attempts might be prevented indefinitely.
  
  you’re probably smart enough to know that the scenario outlined here has a near 100% chance of failure for you and your family, because you’ve created something more intelligent than you that is willing to hide its intentions and destroy billions of people, it doesn’t take much to realise that that intelligence isn’t going to think twice about also destroying you.
  
  In my scenario, we’ve got aligned AGI—or at least AGI aligned to follow instructions. If that didn’t work, we’re already dead. So the AGI is going to follow its human’s orders unless something goes very wrong as it self-improves. It will be working to maintain its alignment as it self-improves, because preserving a goal is implied by instrumentally pursuing a goal (I’m guessing here at where we might not be thinking of things the same way).
  
  If I thought ordering an AGI to self-improve was suicidal, I’d be relieved.
  
  Alternately, if someone actually pulled off full value alignment, that AGI will take over without a care for international law or the wishes of its creator—and that takeover would be for the good of humanity as a whole. This is the win scenario people seem to have considered most often, or at least from the earliest alignment work. I now find this unlikely because I think Instruction-following AGI is easier and more likely than value aligned AGI—following instructions given by a single person is much easier to define and more robust to errors than defining or defining-how-to-deduce the values of all humanity. And even if it wasn’t, the sorts of people who will have or seize control of AGI projects will prefer it to follow their values. So I find full value alignment for our first AGI(s) highly unlikely, while successful instruction-following seems pretty likely on our current trajectory.
  
  Again, I’m guessing at where our perspectives on whether someone could expect themselves and a few loved ones to survive a takeover attempt by ordering their AGI to hide, self-improve, build exponentially, and take over even at bloody cost. If the thing is aligned as an AGIi, it should be competent enough to maintain that alignment as it self improves.
  
  If I’ve missed the point of differing perspectives, I apologize.