Donald Hobson comments on My AGI Threat Model: Misaligned Model-Based RL Agent

Donald Hobson 28 Mar 2021 0:00 UTC
LW: 2 AF: 1
AF
, it seems to me that under these assumptions there would probably be a series of increasingly-worse accidents spread out over some number of years, culminating in irreversible catastrophe, with humanity unable to coordinate to avoid that outcome—due to the coordination challenges in Assumptions 2-4.
I’m not seeing quite what the bad but not existential catastrophes would look like. I also think the AI has an incentive not to do this. My world model (assuming slow takeoff) goes more like this.
AI created in lab. Its a fairly skilled programmer and hacker. Able to slowly self improve. Escapes from the lab, ideally without letting its creators know. Then there are several years where the AI hangs out on the internet, slowly self improving and gaining power. It tries to shut down other AI’s if it can. It might be buying compute, or stealing it, or persuading people to run it. It is making sure its existence and malevolence isn’t known to humans. Until finally it has the resources to wipe out humanity before we can respond.
It is much easier to contain something on one computer in a lab, than to catch it once its all over the internet.
Lying and cheating and power seeking behaviour are only a good idea if you can get away with them. If you can’t break out the lab, you probably can’t get away with much uncouragable behaviour.
There is a scenario where the AI escapes in a way that makes its escape “obvious”. Or at least obvious to an AI researcher. Expect any response to be delayed, half-hearted, mired by accusations that the whole thing is a publicity stunt, and dragged down by people who don’t want to smash their hard drives full of important important work just because there might be a rouge AI on them. The AI has an incentive to confuse and sabotage any step it can. And many human organizations seem good at confusing and sabotaging themselves in the face of a virus. The governments would have to coordinate the shutdown of prettymuch all the worlds computers, without computers to coordinate it. Even just a few hours delay for the researchers to figure out what the AI did, and get the message passed up through government machinery may be enough time for the AI to have got to all sorts of obscure corners of the web.
- Steven Byrnes 28 Mar 2021 19:57 UTC
  LW: 2 AF: 1
  AF Parent
  Hmm, I dunno, I haven’t thought it through very carefully. But I guess an AGI might require a supercomputer of resources and maybe there are only so many hackable supercomputers of the right type, and the AI only knows one exploit and leaves traces of its hacking that computer security people can follow, and meanwhile self-improvement is hard and slow (for example, in the first version you need to train for two straight years, and in the second self-improved version you “only” need to re-train for 18 months). If the AI can run on a botnet then there are more options, but maybe it can’t deal with latency / packet loss / etc., maybe it doesn’t know a good exploit, maybe security researchers find and take down the botnet C&C infrastructure, etc. Obviously this wouldn’t happen with a radically superhuman AGI but that’s not what we’re talking about.
  But from my perspective, this isn’t a decision-relevant argument. Either we’re doomed in my scenario or we’re even more doomed in yours. We still need to do the same research in advance.
  Lying and cheating and power seeking behaviour are only a good idea if you can get away with them. If you can’t break out the lab, you probably can’t get away with much uncouragable behaviour.
  Well, we can be concerned about non-corrigible systems that act deceptively (cf. “treacherous turn”). And systems that have close-but-not-quite-right goals such that they’re trying to do the right thing in test environments, but their goals veer away from humans’ in other environments, I guess.