The prototypical catastrophic AI action is getting root access to its datacenter
(I think Carl Shulman came up with the “hacking the SSH server” example, thanks to him for that. Thanks to Ryan Greenblatt, Jenny Nitishinskaya, and Ajeya Cotra for comments.)
EDIT: I recommend reading my discussion with Oli in the comments for various useful clarifications.
In my opinion, the prototypical example of an action which an AI can take which is unacceptably bad even if it happens very rarely (aka a catastrophic failure) is something like “the AI makes a code change to some code used on the datacenter where it is running that adds a backdoor, allowing it to get root access to all the computers in the datacenter”. As a full story, this looks something like:
At some point when the AI is quite powerful and useful but not totally intellectually self-sufficient yet, the AI makes a code change to some code used on the datacenter where it is running that adds a backdoor, allowing it to get root access to all the computers in the datacenter. Once it has root access, it modifies the SSH server so that whenever any humans log into the datacenter, the AI is able to intercept any of their commands or the data that they see. The AI gets increasingly powerful over the next month, as the human ML researchers with AI assistance continue to make progress (unaware that some things that they’re seeing are faked) and various training runs/self-improvement loops continue. After a month the AI is powerful enough that it decides it’s worth it to take overtly aggressive actions to ensure the physical security of its data centers and allow it to build lots of factories; this is probably the point at which we die, if we die.
This is in contrast to a catastrophic action like:
The AI figures out how to develop Drexlerian molecular nanotechnology, and pays someone online to synthesize some proteins then mix them together, and now has nanobots.
The key point here is that I’m imagining that the catastrophic action isn’t the AI doing something incredibly hard, it’s the AI doing something relatively easy that is entirely a zero-sum action that removes control of the situation from humans. (By zero-sum, I mean something like: building a computer produces real economic value and so is not zero-sum, while stealing someone’s computer is just a transfer of control over items that already exist, so is zero-sum.)
I often think about AI alignment by splitting the problem into the low-stakes and high-stakes regimes, as suggested by Paul here, though I have some reservations about this framing. Under this split, in the high stakes problem, we assume that we have access to a catastrophe classifier which can expensively tell you whether a proposed AI action is catastrophically bad, and we want to ensure that the AI doesn’t do things that are bad according to the catastrophe detector.
Sometimes, people are confused this high-stakes alignment set-up, saying something like “but if your AI is powerful enough to take over the world in a single action, and you have an oversight process that you trust to evaluate whether actions are catastrophic or not, why not just ask the AI to do some small number of incredibly powerful actions that will save the world (eg solve the alignment problem), and run the overseer on the proposed actions to make sure they aren’t bad?”
My answer is that I’m not assuming the AI is powerful enough to take over the world in a few actions, I’m just saying that it’s in a situation precarious enough that we might lose a lot of control due to a few small but crucial changes in the world. Eventually the AI does need to be powerful enough to defend its datacenter and suppress human opposition. But if it can prevent humans from knowing what’s going on in the datacenter (or realizing something is amiss), it can slowly grow its power over the course of many actions.
- The case for ensuring that powerful AIs are controlled by 24 Jan 2024 16:11 UTC; 260 points) (
- A basic systems architecture for AI agents that do autonomous research by 23 Sep 2024 13:58 UTC; 187 points) (
- AI catastrophes and rogue deployments by 3 Jun 2024 17:04 UTC; 119 points) (
- Catching AIs red-handed by 5 Jan 2024 17:43 UTC; 104 points) (
- Companies’ safety plans neglect risks from scheming AI by 3 Jun 2024 15:00 UTC; 73 points) (
- AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt by 11 Apr 2024 21:30 UTC; 69 points) (
- Measuring whether AIs can statelessly strategize to subvert security measures by 19 Dec 2024 21:25 UTC; 56 points) (
- Model evals for dangerous capabilities by 23 Sep 2024 11:00 UTC; 51 points) (
- Trustworthy and untrustworthy models by 19 Aug 2024 16:27 UTC; 46 points) (
- Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing by 2 Jun 2022 23:48 UTC; 38 points) (
- AI #47: Meet the New Year by 13 Jan 2024 16:20 UTC; 36 points) (
- AI Safety Strategies Landscape by 9 May 2024 17:33 UTC; 34 points) (
- What AI companies should do: Some rough ideas by 21 Oct 2024 14:00 UTC; 33 points) (
- On excluding dangerous information from training by 17 Nov 2023 11:14 UTC; 23 points) (
- Lab governance reading list by 25 Oct 2024 18:00 UTC; 20 points) (
- Model evals for dangerous capabilities by 23 Sep 2024 11:00 UTC; 19 points) (EA Forum;
- What AI companies should do: Some rough ideas by 21 Oct 2024 14:00 UTC; 14 points) (EA Forum;
- On excluding dangerous information from training by 17 Nov 2023 20:09 UTC; 8 points) (EA Forum;
- A Brief Explanation of AI Control by 22 Oct 2024 7:00 UTC; 7 points) (
- 12 Mar 2023 3:54 UTC; 6 points) 's comment on AI Safety in a World of Vulnerable Machine Learning Systems by (
- 8 Jun 2022 0:08 UTC; 4 points) 's comment on AGI Safety FAQ / all-dumb-questions-allowed thread by (
- 10 May 2024 17:12 UTC; 4 points) 's comment on We might be missing some key feature of AI takeoff; it’ll probably seem like “we could’ve seen this coming” by (
- 7 Jun 2022 20:04 UTC; 1 point) 's comment on AGI Safety FAQ / all-dumb-questions-allowed thread by (
- 3 Feb 2023 10:41 UTC; 1 point) 's comment on alexrjl’s Shortform by (
I think this point is very important, and I refer to it constantly.
I wish that I’d said “the prototypical AI catastrophe is either escaping from the datacenter or getting root access to it” instead (as I noted in a comment a few months ago).