AGI safety from first principles: Introduction
This is the first part of a six-part report called AGI safety from first principles, in which I’ve attempted to put together the most complete and compelling case I can for why the development of AGI might pose an existential threat. The report stems from my dissatisfaction with existing arguments about the potential risks from AGI. Early work tends to be less relevant in the context of modern machine learning; more recent work is scattered and brief. I originally intended to just summarise other people’s arguments, but as this report has grown, it’s become more representative of my own views and less representative of anyone else’s. So while it covers the standard ideas, I also think that it provides a new perspective on how to think about AGI—one which doesn’t take any previous claims for granted, but attempts to work them out from first principles.
Having said that, the breadth of the topic I’m attempting to cover means that I’ve included many arguments which are only hastily sketched out, and undoubtedly a number of mistakes. I hope to continue polishing this report, and I welcome feedback and help in doing so. I’m also grateful to many people who have given feedback and encouragement so far. I plan to cross-post some of the most useful comments I’ve received to the Alignment Forum once I’ve had a chance to ask permission. I’ve posted the report itself in six sections; the first and last are shorter framing sections, while the middle four correspond to the four premises of the argument laid out below.
AGI safety from first principles
The key concern motivating technical AGI safety research is that we might build autonomous artificially intelligent agents which are much more intelligent than humans, and which pursue goals that conflict with our own. Human intelligence allows us to coordinate complex societies and deploy advanced technology, and thereby control the world to a greater extent than any other species. But AIs will eventually become more capable than us at the types of tasks by which we maintain and exert that control. If they don’t want to obey us, then humanity might become only Earth’s second most powerful “species”, and lose the ability to create a valuable and worthwhile future.
I’ll call this the “second species” argument; I think it’s a plausible argument which we should take very seriously.[1] However, the version stated above relies on several vague concepts and intuitions. In this report I’ll give the most detailed presentation of the second species argument that I can, highlighting the aspects that I’m still confused about. In particular, I’ll defend a version of the second species argument which claims that, without a concerted effort to prevent it, there’s a significant chance that:
We’ll build AIs which are much more intelligent than humans (i.e. superintelligent).
Those AIs will be autonomous agents which pursue large-scale goals.
Those goals will be misaligned with ours; that is, they will aim towards outcomes that aren’t desirable by our standards, and trade off against our goals.
The development of such AIs would lead to them gaining control of humanity’s future.
While I use many examples from modern deep learning, this report is also intended to apply to AIs developed using very different models, training algorithms, optimisers, or training regimes than the ones we use today. However, many of my arguments would no longer be relevant if the field of AI moves away from focusing on machine learning. I also frequently compare AI development to the evolution of human intelligence; while the two aren’t fully analogous, humans are the best example we currently have to ground our thinking about generally intelligent AIs.
- ↩︎
Stuart Russell also refers to this as the “gorilla problem” in his recent book, Human Compatible.
- Book Launch: “The Carving of Reality,” Best of LessWrong vol. III by Aug 16, 2023, 11:52 PM; 131 points) (
- Ten arguments that AI is an existential risk by Aug 13, 2024, 5:00 PM; 113 points) (
- Oversight Misses 100% of Thoughts The AI Does Not Think by Aug 12, 2022, 4:30 PM; 110 points) (
- Voting Results for the 2020 Review by Feb 2, 2022, 6:37 PM; 108 points) (
- Video and Transcript of Presentation on Existential Risk from Power-Seeking AI by May 8, 2022, 3:52 AM; 97 points) (EA Forum;
- Prizes for the 2020 Review by Feb 20, 2022, 9:07 PM; 94 points) (
- The alignment problem in different capability regimes by Sep 9, 2021, 7:46 PM; 88 points) (
- Jun 7, 2021, 9:51 PM; 75 points) 's comment on My current impressions on career choice for longtermists by (EA Forum;
- 2020 Review Article by Jan 14, 2022, 4:58 AM; 74 points) (
- My AGI Threat Model: Misaligned Model-Based RL Agent by Mar 25, 2021, 1:45 PM; 74 points) (
- $20K In Bounties for AI Safety Public Materials by Aug 5, 2022, 2:52 AM; 71 points) (
- AI Risk for Epistemic Minimalists by Aug 22, 2021, 3:39 PM; 58 points) (
- $20K in Bounties for AI Safety Public Materials by Aug 5, 2022, 2:57 AM; 45 points) (EA Forum;
- A Model-based Approach to AI Existential Risk by Aug 25, 2023, 10:32 AM; 45 points) (
- General vs specific arguments for the longtermist importance of shaping AI development by Oct 15, 2021, 2:43 PM; 44 points) (EA Forum;
- Ten arguments that AI is an existential risk by Aug 14, 2024, 9:51 PM; 30 points) (EA Forum;
- AI Safety Overview: CERI Summer Research Fellowship by Mar 24, 2022, 3:12 PM; 29 points) (EA Forum;
- Paradigm-building from first principles: Effective altruism, AGI, and alignment by Feb 8, 2022, 4:12 PM; 29 points) (
- Why I’m Worried About AI by May 23, 2022, 9:13 PM; 22 points) (
- Do alignment concerns extend to powerful non-AI agents? by Jun 24, 2022, 6:26 PM; 21 points) (
- Video and Transcript of Presentation on Existential Risk from Power-Seeking AI by May 8, 2022, 3:50 AM; 20 points) (
- What role should evolutionary analogies play in understanding AI takeoff speeds? by Dec 11, 2021, 1:19 AM; 14 points) (
- How/When Should One Introduce AI Risk Arguments to People Unfamiliar With the Idea? by Aug 9, 2022, 2:57 AM; 12 points) (EA Forum;
- What role should evolutionary analogies play in understanding AI takeoff speeds? by Dec 11, 2021, 1:16 AM; 12 points) (EA Forum;
- Jun 23, 2021, 11:04 PM; 12 points) 's comment on Frequent arguments about alignment by (
- May 25, 2023, 9:26 PM; 6 points) 's comment on Adumbrations on AGI from an outsider by (
- Dec 29, 2022, 4:50 PM; 4 points) 's comment on AI alignment is distinct from its near-term applications by (
- Oct 21, 2020, 11:09 AM; 4 points) 's comment on AGI safety from first principles: Control by (
- May 30, 2022, 5:07 PM; 3 points) 's comment on We should expect to worry more about speculative risks by (EA Forum;
- Apr 15, 2023, 2:49 PM; 3 points) 's comment on An example elevator pitch for AI doom by (
- Sep 6, 2024, 1:25 PM; 3 points) 's comment on AI x Human Flourishing: Introducing the Cosmos Institute by (
- Aug 24, 2022, 2:58 PM; 2 points) 's comment on Could a ‘permanent global totalitarian state’ ever be permanent? by (EA Forum;
- Oct 25, 2023, 6:10 PM; 1 point) 's comment on Superintelligence FAQ by (
I haven’t had time to reread this sequence in depth, but I wanted to at least touch on how I’d evaluate it. It seems to be aiming to be both a good introductory sequence, while being a “complete and compelling case I can for why the development of AGI might pose an existential threat”.
The question is who is this sequence for, what is it’s goal, and how does it compare to other writing targeting similar demographics.
Some writing that comes to mind to compare/contrast it with includes:
Scott Alexander’s Superintelligence FAQ. This is the post I’ve found most helpful for convincing people (including myself), that yes, AI is just actually a big deal and an extinction risk. It’s 8000 words. It’s written fairly entertainingly. What I find particularly compelling here are a bunch of factual statements about recent AI advances that I hadn’t known about at the time.
Tim Urban’s Road To Superintelligence series. This is even more optimized for entertainingness. I recall it being a bit more handwavy and making some claims that were either objectionable, or at least felt more objectionable. It’s 22,000 words.
Alex Flint’s AI Risk for Epistemic Minimalists. This goes in a pretty different direction – not entertaining, and not really comprehensive either . It came to mind because it’s doing a sort-of-similar thing of “remove as many prerequisites or assumptions as possible”. (I’m not actually sure it’s that helpful, the specific assumptions it’s avoiding making don’t feel like issues I expect to come up for most people, and then it doesn’t make a very strong claim about what to do)
(I recall Scott Alexander once trying to run a pseudo-study where he had people read a randomized intro post on AI alignment, I think including his own Superintelligence FAQ and Tim Urban’s posts among others, and see how it changed people’s minds. I vaguely recall it didn’t find that big a difference between them. I’d be curious how this compared)
At a glance, AGI Safety From First Principles seems to be more complete than Alex Flint’s piece, and more serious/a-bit-academic than Scott or Tim’s writing. I assume it’s aiming for a somewhat skeptical researcher, and is meant to not only convince them the problem exists, but give them some technical hooks of how to start thinking about it. I’m curious how well it actually succeeds at that.