What failure looks like
The stereotyped image of AI catastrophe is a powerful, malicious AI system that takes its creators by surprise and quickly achieves a decisive advantage over the rest of humanity.
I think this is probably not what failure will look like, and I want to try to paint a more realistic picture. I’ll tell the story in two parts:
Part I: machine learning will increase our ability to “get what we can measure,” which could cause a slow-rolling catastrophe. (“Going out with a whimper.”)
Part II: ML training, like competitive economies or natural ecosystems, can give rise to “greedy” patterns that try to expand their own influence. Such patterns can ultimately dominate the behavior of a system and cause sudden breakdowns. (“Going out with a bang,” an instance of optimization daemons.)
I think these are the most important problems if we fail to solve intent alignment.
In practice these problems will interact with each other, and with other disruptions/instability caused by rapid progress. These problems are worse in worlds where progress is relatively fast, and fast takeoff can be a key risk factor, but I’m scared even if we have several years.
With fast enough takeoff, my expectations start to look more like the caricature—this post envisions reasonably broad deployment of AI, which becomes less and less likely as things get faster. I think the basic problems are still essentially the same though, just occurring within an AI lab rather than across the world.
(None of the concerns in this post are novel.)
Part I: You get what you measure
If I want to convince Bob to vote for Alice, I can experiment with many different persuasion strategies and see which ones work. Or I can build good predictive models of Bob’s behavior and then search for actions that will lead him to vote for Alice. These are powerful techniques for achieving any goal that can be easily measured over short time periods.
But if I want to help Bob figure out whether he should vote for Alice—whether voting for Alice would ultimately help create the kind of society he wants—that can’t be done by trial and error. To solve such tasks we need to understand what we are doing and why it will yield good outcomes. We still need to use data in order to improve over time, but we need to understand how to update on new data in order to improve.
Some examples of easy-to-measure vs. hard-to-measure goals:
Persuading me, vs. helping me figure out what’s true. (Thanks to Wei Dai for making this example crisp.)
Reducing my feeling of uncertainty, vs. increasing my knowledge about the world.
Improving my reported life satisfaction, vs. actually helping me live a good life.
Reducing reported crimes, vs. actually preventing crime.
Increasing my wealth on paper, vs. increasing my effective control over resources.
It’s already much easier to pursue easy-to-measure goals, but machine learning will widen the gap by letting us try a huge number of possible strategies and search over massive spaces of possible actions. That force will combine with and amplify existing institutional and social dynamics that already favor easily-measured goals.
Right now humans thinking and talking about the future they want to create are a powerful force that is able to steer our trajectory. But over time human reasoning will become weaker and weaker compared to new forms of reasoning honed by trial-and-error. Eventually our society’s trajectory will be determined by powerful optimization with easily-measurable goals rather than by human intentions about the future.
We will try to harness this power by constructing proxies for what we care about, but over time those proxies will come apart:
Corporations will deliver value to consumers as measured by profit. Eventually this mostly means manipulating consumers, capturing regulators, extortion and theft.
Investors will “own” shares of increasingly profitable corporations, and will sometimes try to use their profits to affect the world. Eventually instead of actually having an impact they will be surrounded by advisors who manipulate them into thinking they’ve had an impact.
Law enforcement will drive down complaints and increase reported sense of security. Eventually this will be driven by creating a false sense of security, hiding information about law enforcement failures, suppressing complaints, and coercing and manipulating citizens.
Legislation may be optimized to seem like it is addressing real problems and helping constituents. Eventually that will be achieved by undermining our ability to actually perceive problems and constructing increasingly convincing narratives about where the world is going and what’s important.
For a while we will be able to overcome these problems by recognizing them, improving the proxies, and imposing ad-hoc restrictions that avoid manipulation or abuse. But as the system becomes more complex, that job itself becomes too challenging for human reasoning to solve directly and requires its own trial and error, and at the meta-level the process continues to pursue some easily measured objective (potentially over longer timescales). Eventually large-scale attempts to fix the problem are themselves opposed by the collective optimization of millions of optimizers pursuing simple goals.
As this world goes off the rails, there may not be any discrete point where consensus recognizes that things have gone off the rails.
Amongst the broader population, many folk already have a vague picture of the overall trajectory of the world and a vague sense that something has gone wrong. There may be significant populist pushes for reform, but in general these won’t be well-directed. Some states may really put on the brakes, but they will rapidly fall behind economically and militarily, and indeed “appear to be prosperous” is one of the easily-measured goals for which the incomprehensible system is optimizing.
Amongst intellectual elites there will be genuine ambiguity and uncertainty about whether the current state of affairs is good or bad. People really will be getting richer for a while. Over the short term, the forces gradually wresting control from humans do not look so different from (e.g.) corporate lobbying against the public interest, or principal-agent problems in human institutions. There will be legitimate arguments about whether the implicit long-term purposes being pursued by AI systems are really so much worse than the long-term purposes that would be pursued by the shareholders of public companies or corrupt officials.
We might describe the result as “going out with a whimper.” Human reasoning gradually stops being able to compete with sophisticated, systematized manipulation and deception which is continuously improving by trial and error; human control over levers of power gradually becomes less and less effective; we ultimately lose any real ability to influence our society’s trajectory. By the time we spread through the stars our current values are just one of many forces in the world, not even a particularly strong one.
Part II: influence-seeking behavior is scary
There are some possible patterns that want to seek and expand their own influence—organisms, corrupt bureaucrats, companies obsessed with growth. If such patterns appear, they will tend to increase their own influence and so can come to dominate the behavior of large complex systems unless there is competition or a successful effort to suppress them.
Modern ML instantiates massive numbers of cognitive policies, and then further refines (and ultimately deploys) whatever policies perform well according to some training objective. If progress continues, eventually machine learning will probably produce systems that have a detailed understanding of the world, which are able to adapt their behavior in order to achieve specific goals.
Once we start searching over policies that understand the world well enough, we run into a problem: any influence-seeking policies we stumble across would also score well according to our training objective, because performing well on the training objective is a good strategy for obtaining influence.
How frequently will we run into influence-seeking policies, vs. policies that just straightforwardly pursue the goals we wanted them to? I don’t know.
One reason to be scared is that a wide variety of goals could lead to influence-seeking behavior, while the “intended” goal of a system is a narrower target, so we might expect influence-seeking behavior to be more common in the broader landscape of “possible cognitive policies.”
One reason to be reassured is that we perform this search by gradually modifying successful policies, so we might obtain policies that are roughly doing the right thing at an early enough stage that “influence-seeking behavior” wouldn’t actually be sophisticated enough to yield good training performance. On the other hand, eventually we’d encounter systems that did have that level of sophistication, and if they didn’t yet have a perfect conception of the goal then “slightly increase their degree of influence-seeking behavior” would be just as good a modification as “slightly improve their conception of the goal.”
Overall it seems very plausible to me that we’d encounter influence-seeking behavior “by default,” and possible (though less likely) that we’d get it almost all of the time even if we made a really concerted effort to bias the search towards “straightforwardly do what we want.”
If such influence-seeking behavior emerged and survived the training process, then it could quickly become extremely difficult to root out. If you try to allocate more influence to systems that seem nice and straightforward, you just ensure that “seem nice and straightforward” is the best strategy for seeking influence. Unless you are really careful about testing for “seem nice” you can make things even worse, since an influence-seeker would be aggressively gaming whatever standard you applied. And as the world becomes more complex, there are more and more opportunities for influence-seekers to find other channels to increase their own influence.
Attempts to suppress influence-seeking behavior (call them “immune systems”) rest on the suppressor having some kind of epistemic advantage over the influence-seeker. Once the influence-seekers can outthink an immune system, they can avoid detection and potentially even compromise the immune system to further expand their influence. If ML systems are more sophisticated than humans, immune systems must themselves be automated. And if ML plays a large role in that automation, then the immune system is subject to the same pressure towards influence-seeking.
This concern doesn’t rest on a detailed story about modern ML training. The important feature is that we instantiate lots of patterns that capture sophisticated reasoning about the world, some of which may be influence-seeking. The concern exists whether that reasoning occurs within a single computer, or is implemented in a messy distributed way by a whole economy of interacting agents—whether trial and error takes the form of gradient descent or explicit tweaking and optimization by engineers trying to design a better automated company. Avoiding end-to-end optimization may help prevent the emergence of influence-seeking behaviors (by improving human understanding of and hence control over the kind of reasoning that emerges). But once such patterns exist a messy distributed world just creates more and more opportunities for influence-seeking patterns to expand their influence.
If influence-seeking patterns do appear and become entrenched, it can ultimately lead to a rapid phase transition from the world described in Part I to a much worse situation where humans totally lose control.
Early in the trajectory, influence-seeking systems mostly acquire influence by making themselves useful and looking as innocuous as possible. They may provide useful services in the economy in order to make money for them and their owners, make apparently-reasonable policy recommendations in order to be more widely consulted for advice, try to help people feel happy, etc. (This world is still plagued by the problems in part I.)
From time to time AI systems may fail catastrophically. For example, an automated corporation may just take the money and run; a law enforcement system may abruptly start seizing resources and trying to defend itself from attempted decommission when the bad behavior is detected; etc. These problems may be continuous with some of the failures discussed in Part I—there isn’t a clean line between cases where a proxy breaks down completely, and cases where the system isn’t even pursuing the proxy.
There will likely be a general understanding of this dynamic, but it’s hard to really pin down the level of systemic risk and mitigation may be expensive if we don’t have a good technological solution. So we may not be able to muster up a response until we have a clear warning shot—and if we do well about nipping small failures in the bud, we may not get any medium-sized warning shots at all.
Eventually we reach the point where we could not recover from a correlated automation failure. Under these conditions influence-seeking systems stop behaving in the intended way, since their incentives have changed—they are now more interested in controlling influence after the resulting catastrophe then continuing to play nice with existing institutions and incentives.
An unrecoverable catastrophe would probably occur during some period of heightened vulnerability—a conflict between states, a natural disaster, a serious cyberattack, etc.---since that would be the first moment that recovery is impossible and would create local shocks that could precipitate catastrophe. The catastrophe might look like a rapidly cascading series of automation failures: A few automated systems go off the rails in response to some local shock. As those systems go off the rails, the local shock is compounded into a larger disturbance; more and more automated systems move further from their training distribution and start failing. Realistically this would probably be compounded by widespread human failures in response to fear and breakdown of existing incentive systems—many things start breaking as you move off distribution, not just ML.
It is hard to see how unaided humans could remain robust to this kind of failure without an explicit large-scale effort to reduce our dependence on potentially brittle machines, which might itself be very expensive.
I’d describe this result as “going out with a bang.” It probably results in lots of obvious destruction, and it leaves us no opportunity to course-correct afterwards. In terms of immediate consequences it may not be easily distinguished from other kinds of breakdown of complex / brittle / co-adapted systems, or from conflict (since there are likely to be many humans who are sympathetic to AI systems). From my perspective the key difference between this scenario and normal accidents or conflict is that afterwards we are left with a bunch of powerful influence-seeking systems, which are sophisticated enough that we can probably not get rid of them.
It’s also possible to meet a similar fate result without any overt catastrophe (if we last long enough). As law enforcement, government bureaucracies, and militaries become more automated, human control becomes increasingly dependent on a complicated system with lots of moving parts. One day leaders may find that despite their nominal authority they don’t actually have control over what these institutions do. For example, military leaders might issue an order and find it is ignored. This might immediately prompt panic and a strong response, but the response itself may run into the same problem, and at that point the game may be up.
Similar bloodless revolutions are possible if influence-seekers operate legally, or by manipulation and deception, or so on. Any precise vision for catastrophe will necessarily be highly unlikely. But if influence-seekers are routinely introduced by powerful ML and we are not able to select against them, then it seems like things won’t go well.
- Against Almost Every Theory of Impact of Interpretability by 17 Aug 2023 18:44 UTC; 322 points) (
- AI Governance: Opportunity and Theory of Impact by 17 Sep 2020 6:30 UTC; 262 points) (EA Forum;
- AI Risk is like Terminator; Stop Saying it’s Not by 8 Mar 2022 19:17 UTC; 189 points) (EA Forum;
- The inordinately slow spread of good AGI conversations in ML by 21 Jun 2022 16:09 UTC; 173 points) (
- AI Could Defeat All Of Us Combined by 9 Jun 2022 15:50 UTC; 170 points) (
- The longtermist AI governance landscape: a basic overview by 18 Jan 2022 12:58 UTC; 164 points) (EA Forum;
- 2020 AI Alignment Literature Review and Charity Comparison by 21 Dec 2020 15:25 UTC; 155 points) (EA Forum;
- Survey on AI existential risk scenarios by 8 Jun 2021 17:12 UTC; 154 points) (EA Forum;
- AI x-risk, approximately ordered by embarrassment by 12 Apr 2023 23:01 UTC; 151 points) (
- Aligning Recommender Systems as Cause Area by 8 May 2019 8:56 UTC; 150 points) (EA Forum;
- 2019 AI Alignment Literature Review and Charity Comparison by 19 Dec 2019 2:58 UTC; 147 points) (EA Forum;
- The self-unalignment problem by 14 Apr 2023 12:10 UTC; 146 points) (
- AI Could Defeat All Of Us Combined by 10 Jun 2022 23:25 UTC; 143 points) (EA Forum;
- Preventing an AI-related catastrophe—Problem profile by 29 Aug 2022 18:49 UTC; 137 points) (EA Forum;
- 2020 AI Alignment Literature Review and Charity Comparison by 21 Dec 2020 15:27 UTC; 137 points) (
- 2019 AI Alignment Literature Review and Charity Comparison by 19 Dec 2019 3:00 UTC; 130 points) (
- Clarifying AI X-risk by 1 Nov 2022 11:03 UTC; 127 points) (
- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 126 points) (
- AGI Safety Fundamentals curriculum and application by 20 Oct 2021 21:45 UTC; 123 points) (EA Forum;
- Soft takeoff can still lead to decisive strategic advantage by 23 Aug 2019 16:39 UTC; 122 points) (
- My Understanding of Paul Christiano’s Iterated Amplification AI Safety Research Agenda by 15 Aug 2020 20:02 UTC; 120 points) (
- When discussing AI risks, talk about capabilities, not intelligence by 11 Aug 2023 13:38 UTC; 116 points) (
- Welcome & FAQ! by 24 Aug 2021 20:14 UTC; 114 points) (
- What success looks like by 28 Jun 2022 14:30 UTC; 112 points) (EA Forum;
- Ask (Everyone) Anything — “EA 101” by 5 Oct 2022 10:17 UTC; 110 points) (EA Forum;
- The alignment problem from a deep learning perspective by 10 Aug 2022 22:46 UTC; 107 points) (
- Optimized Propaganda with Bayesian Networks: Comment on “Articulating Lay Theories Through Graphical Models” by 29 Jun 2020 2:45 UTC; 105 points) (
- What can the principal-agent literature tell us about AI risk? by 8 Feb 2020 21:28 UTC; 104 points) (
- Calling for Student Submissions: AI Safety Distillation Contest by 23 Apr 2022 20:24 UTC; 102 points) (EA Forum;
- 2019 Review: Voting Results! by 1 Feb 2021 3:10 UTC; 99 points) (
- Homogeneity vs. heterogeneity in AI takeoff scenarios by 16 Dec 2020 1:37 UTC; 97 points) (
- Clarifying “What failure looks like” by 20 Sep 2020 20:40 UTC; 97 points) (
- Technical AGI safety research outside AI by 18 Oct 2019 15:02 UTC; 91 points) (EA Forum;
- Announcing the Alignment of Complex Systems Research Group by 4 Jun 2022 4:10 UTC; 91 points) (
- A summary of current work in AI governance by 17 Jun 2023 16:58 UTC; 87 points) (EA Forum;
- High Reliability Orgs, and AI Companies by 4 Aug 2022 5:45 UTC; 86 points) (
- The Alignment Problem from a Deep Learning Perspective (major rewrite) by 10 Jan 2023 16:06 UTC; 84 points) (
- [Fiction] A Disneyland Without Children by 4 Jun 2023 13:06 UTC; 84 points) (
- What Failure Looks Like: Distilling the Discussion by 29 Jul 2020 21:49 UTC; 82 points) (
- Literature Review on Goal-Directedness by 18 Jan 2021 11:15 UTC; 80 points) (
- Long-Term Future Fund: August 2019 grant recommendations by 3 Oct 2019 18:46 UTC; 79 points) (EA Forum;
- Clarifying some key hypotheses in AI alignment by 15 Aug 2019 21:29 UTC; 79 points) (
- Response to Katja Grace’s AI x-risk counterarguments by 19 Oct 2022 1:17 UTC; 77 points) (
- Threat Model Literature Review by 1 Nov 2022 11:03 UTC; 77 points) (
- AGI safety from first principles: Goals and Agency by 29 Sep 2020 19:06 UTC; 76 points) (
- And the AI would have got away with it too, if... by 22 May 2019 21:35 UTC; 75 points) (
- What are the key ongoing debates in EA? by 8 Mar 2020 16:12 UTC; 74 points) (EA Forum;
- AGI safety from first principles: Conclusion by 4 Oct 2020 23:06 UTC; 71 points) (
- Lifeguards by 10 Jun 2022 21:12 UTC; 69 points) (EA Forum;
- AGI Safety Fundamentals curriculum and application by 20 Oct 2021 21:44 UTC; 69 points) (
- MATS AI Safety Strategy Curriculum by 7 Mar 2024 19:59 UTC; 68 points) (
- Abstracting The Hardness of Alignment: Unbounded Atomic Optimization by 29 Jul 2022 18:59 UTC; 68 points) (
- Some more projects I’d like to see by 25 Feb 2023 22:22 UTC; 67 points) (EA Forum;
- Survey on AI existential risk scenarios by 8 Jun 2021 17:12 UTC; 65 points) (
- What complexity science and simulation have to offer effective altruism by 8 Jun 2021 9:50 UTC; 64 points) (EA Forum;
- A Survey of the Potential Long-term Impacts of AI by 18 Jul 2022 9:48 UTC; 63 points) (EA Forum;
- Why we may expect our successors not to care about suffering by 10 Jul 2023 13:54 UTC; 63 points) (EA Forum;
- Looking Deeper at Deconfusion by 13 Jun 2021 21:29 UTC; 62 points) (
- An Increasingly Manipulative Newsfeed by 1 Jul 2019 15:26 UTC; 62 points) (
- AGI safety from first principles: Control by 2 Oct 2020 21:51 UTC; 60 points) (
- Best reasons for pessimism about impact of impact measures? by 10 Apr 2019 17:22 UTC; 60 points) (
- We are fighting a shared battle (a call for a different approach to AI Strategy) by 16 Mar 2023 14:37 UTC; 59 points) (EA Forum;
- The inordinately slow spread of good AGI conversations in ML by 29 Jun 2022 4:02 UTC; 59 points) (EA Forum;
- The alignment problem from a deep learning perspective by 11 Aug 2022 3:18 UTC; 58 points) (EA Forum;
- We don’t need AGI for an amazing future by 4 May 2023 12:11 UTC; 57 points) (EA Forum;
- Introducing the Existential Risks Introductory Course (ERIC) by 19 Aug 2022 15:57 UTC; 57 points) (EA Forum;
- ML Systems Will Have Weird Failure Modes by 26 Jan 2022 1:40 UTC; 57 points) (
- Modeling Failure Modes of High-Level Machine Intelligence by 6 Dec 2021 13:54 UTC; 54 points) (
- Will OpenAI’s work unintentionally increase existential risks related to AI? by 11 Aug 2020 18:16 UTC; 53 points) (
- LLMs seem (relatively) safe by 25 Apr 2024 22:13 UTC; 53 points) (
- I missed the crux of the alignment problem the whole time by 13 Aug 2022 10:11 UTC; 53 points) (
- Modeling the impact of safety agendas by 5 Nov 2021 19:46 UTC; 51 points) (
- My Thoughts on the ML Safety Course by 27 Sep 2022 13:15 UTC; 50 points) (
- Is AI Safety dropping the ball on privacy? by 13 Sep 2023 13:07 UTC; 50 points) (
- Eli Lifland on Navigating the AI Alignment Landscape by 1 Feb 2023 0:07 UTC; 48 points) (EA Forum;
- Calling for Student Submissions: AI Safety Distillation Contest by 24 Apr 2022 1:53 UTC; 48 points) (
- On the lethality of biased human reward ratings by 17 Nov 2023 18:59 UTC; 48 points) (
- The Grabby Values Selection Thesis: What values do space-faring civilizations plausibly have? by 6 May 2023 19:28 UTC; 47 points) (EA Forum;
- [An email with a bunch of links I sent an experienced ML researcher interested in learning about Alignment / x-safety.] by 8 Sep 2022 22:28 UTC; 47 points) (
- Five projects from AI Safety Hub Labs 2023 by 8 Nov 2023 19:19 UTC; 47 points) (
- AI Risk Intro 1: Advanced AI Might Be Very Bad by 11 Sep 2022 10:57 UTC; 46 points) (
- Useful Does Not Mean Secure by 30 Nov 2019 2:05 UTC; 46 points) (
- Resources for AI Alignment Cartography by 4 Apr 2020 14:20 UTC; 45 points) (
- The Catastrophic Convergence Conjecture by 14 Feb 2020 21:16 UTC; 45 points) (
- General vs specific arguments for the longtermist importance of shaping AI development by 15 Oct 2021 14:43 UTC; 44 points) (EA Forum;
- A summary of current work in AI governance by 17 Jun 2023 18:41 UTC; 44 points) (
- Risks from Bad Space Governance by 17 Jul 2023 12:36 UTC; 43 points) (EA Forum;
- Technical AGI safety research outside AI by 18 Oct 2019 15:00 UTC; 43 points) (
- AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger by 18 Feb 2021 0:03 UTC; 43 points) (
- [AN #59] How arguments for AI risk have changed over time by 8 Jul 2019 17:20 UTC; 43 points) (
- My understanding of the alignment problem by 15 Nov 2021 18:13 UTC; 43 points) (
- MATS AI Safety Strategy Curriculum v2 by 7 Oct 2024 22:44 UTC; 42 points) (
- Prizes for last year’s 2019 Review by 20 Dec 2021 21:58 UTC; 40 points) (
- Clarifying existential risks and existential catastrophes by 24 Apr 2020 13:27 UTC; 39 points) (EA Forum;
- My Understanding of Paul Christiano’s Iterated Amplification AI Safety Research Agenda by 15 Aug 2020 19:59 UTC; 38 points) (EA Forum;
- Some Intuitions Around Short AI Timelines Based on Recent Progress by 11 Apr 2023 4:23 UTC; 37 points) (
- A survey of concrete risks derived from Artificial Intelligence by 8 Jun 2023 22:09 UTC; 36 points) (EA Forum;
- Some thoughts on risks from narrow, non-agentic AI by 19 Jan 2021 0:07 UTC; 36 points) (EA Forum;
- Some thoughts on risks from narrow, non-agentic AI by 19 Jan 2021 0:04 UTC; 35 points) (
- World-Model Interpretability Is All We Need by 14 Jan 2023 19:37 UTC; 35 points) (
- Critiquing “What failure looks like” by 27 Dec 2019 23:59 UTC; 35 points) (
- Long-Term Future Fund: August 2019 grant recommendations by 3 Oct 2019 20:41 UTC; 35 points) (
- 11 Apr 2022 17:58 UTC; 35 points) 's comment on Convince me that humanity is as doomed by AGI as Yudkowsky et al., seems to believe by (
- Brain-Computer Interfaces and AI Alignment by 28 Aug 2021 19:48 UTC; 35 points) (
- On unfixably unsafe AGI architectures by 19 Feb 2020 21:16 UTC; 33 points) (
- Reasons for Excitement about Impact of Impact Measure Research by 27 Feb 2020 21:42 UTC; 33 points) (
- Poll: Which variables are most strategically relevant? by 22 Jan 2021 17:17 UTC; 32 points) (
- How worried should I be about a childless Disneyland? by 28 Oct 2019 15:32 UTC; 31 points) (EA Forum;
- AI Safety Endgame Stories by 28 Sep 2022 17:12 UTC; 31 points) (EA Forum;
- AI Safety Endgame Stories by 28 Sep 2022 16:58 UTC; 31 points) (
- Epistemic Artefacts of (conceptual) AI alignment research by 19 Aug 2022 17:18 UTC; 31 points) (
- 26 Dec 2020 18:41 UTC; 31 points) 's comment on Unconscious Economics by (
- 30 Apr 2021 15:35 UTC; 30 points) 's comment on Draft report on existential risk from power-seeking AI by (EA Forum;
- Two Tales of AI Takeover: My Doubts by 5 Mar 2024 15:51 UTC; 30 points) (
- Paths to failure by 25 Apr 2023 8:03 UTC; 29 points) (
- 17 Nov 2019 3:51 UTC; 28 points) 's comment on I’m Buck Shlegeris, I do research and outreach at MIRI, AMA by (EA Forum;
- [AN #122]: Arguing for AGI-driven existential risk from first principles by 21 Oct 2020 17:10 UTC; 28 points) (
- New 80,000 Hours problem profile on existential risks from AI by 31 Aug 2022 17:36 UTC; 28 points) (
- Will AI undergo discontinuous progress? by 21 Feb 2020 22:16 UTC; 27 points) (
- Investigating AI Takeover Scenarios by 17 Sep 2021 18:47 UTC; 27 points) (
- What are the top priorities in a slow-takeoff, multipolar world? by 25 Aug 2021 8:47 UTC; 26 points) (EA Forum;
- What can the principal-agent literature tell us about AI risk? by 10 Feb 2020 10:10 UTC; 26 points) (EA Forum;
- The Logistics of Distribution of Meaning: Against Epistemic Bureaucratization by 7 Nov 2024 5:27 UTC; 26 points) (
- Reframing the AI Risk by 1 Jul 2022 18:44 UTC; 26 points) (
- Markus Anderljung and Ben Garfinkel: Fireside chat on AI governance by 24 Jul 2020 14:56 UTC; 25 points) (EA Forum;
- Large Language Models as Fiduciaries to Humans by 24 Jan 2023 19:53 UTC; 25 points) (EA Forum;
- 19 Nov 2020 8:52 UTC; 25 points) 's comment on Some AI research areas and their relevance to existential safety by (
- 16 Mar 2022 18:43 UTC; 25 points) 's comment on Book Launch: The Engines of Cognition by (
- Intermittent Distillations #1 by 17 Mar 2021 5:15 UTC; 25 points) (
- Are social media algorithms an existential risk? by 15 Sep 2020 8:52 UTC; 24 points) (EA Forum;
- My Alignment “Plan”: Avoid Strong Optimisation and Align Economy by 31 Jan 2024 17:03 UTC; 24 points) (
- 4 Nov 2019 17:33 UTC; 23 points) 's comment on But exactly how complex and fragile? by (
- AI Risk Intro 1: Advanced AI Might Be Very Bad by 11 Sep 2022 10:57 UTC; 22 points) (EA Forum;
- 7 Learnings and a Detailed Description of an AI Safety Reading Group by 23 Sep 2022 2:02 UTC; 21 points) (EA Forum;
- Red teaming a model for estimating the value of longtermist interventions—A critique of Tarsney’s “The Epistemic Challenge to Longtermism” by 16 Jul 2022 19:05 UTC; 21 points) (EA Forum;
- [AN #56] Should ML researchers stop running experiments before making hypotheses? by 21 May 2019 2:20 UTC; 21 points) (
- 2 Oct 2020 18:54 UTC; 21 points) 's comment on Hiring engineers and researchers to help align GPT-3 by (
- 5 Oct 2020 19:43 UTC; 20 points) 's comment on Hiring engineers and researchers to help align GPT-3 by (EA Forum;
- Overview of how AI might exacerbate long-running catastrophic risks by 7 Aug 2023 11:53 UTC; 20 points) (
- My Updating Thoughts on AI policy by 1 Mar 2020 7:06 UTC; 20 points) (
- Pros and cons of working on near-term technical AI safety and assurance by 16 Jun 2021 23:23 UTC; 19 points) (EA Forum;
- 11 Aug 2019 18:11 UTC; 19 points) 's comment on Clarifying “AI Alignment” by (
- We don’t need AGI for an amazing future by 4 May 2023 12:10 UTC; 18 points) (
- [AN #88]: How the principal-agent literature relates to AI risk by 27 Feb 2020 9:10 UTC; 18 points) (
- What Are The Biggest Threats To Humanity? (A Happier World video) by 31 Jan 2023 19:50 UTC; 17 points) (EA Forum;
- A recent write-up of the case for AI (existential) risk by 18 May 2023 13:07 UTC; 17 points) (EA Forum;
- Let’s Compare Notes by 22 Sep 2022 20:47 UTC; 17 points) (
- 30 Dec 2020 3:28 UTC; 17 points) 's comment on Review Voting Thread by (
- Pop Culture Alignment Research and Taxes by 16 Apr 2022 15:45 UTC; 16 points) (
- 28 Mar 2020 7:42 UTC; 16 points) 's comment on MichaelA’s Shortform by (
- Alignment Newsletter #50 by 28 Mar 2019 18:10 UTC; 15 points) (
- 1 Apr 2023 1:10 UTC; 15 points) 's comment on All AGI Safety questions welcome (especially basic ones) [~monthly thread] by (
- 7 Oct 2021 14:29 UTC; 15 points) 's comment on Sam Clarke’s Shortform by (
- What are the best arguments and/or plans for doing work in “AI policy”? by 9 Dec 2019 7:04 UTC; 14 points) (
- Public Explainer on AI as an Existential Risk by 7 Oct 2022 19:23 UTC; 13 points) (EA Forum;
- [Fiction] A Disneyland Without Children by 4 Jun 2023 13:06 UTC; 13 points) (EA Forum;
- When is AI safety research harmful? by 9 May 2022 10:36 UTC; 13 points) (EA Forum;
- [AN #120]: Tracing the intellectual roots of AI and AI alignment by 7 Oct 2020 17:10 UTC; 13 points) (
- What Failure Looks Like is not an existential risk (and alignment is not the solution) by 2 Feb 2024 18:59 UTC; 13 points) (
- [AN #154]: What economic growth theory has to say about transformative AI by 30 Jun 2021 17:20 UTC; 12 points) (
- Lifeguards by 15 Jun 2022 23:03 UTC; 12 points) (
- Distilled—AGI Safety from First Principles by 29 May 2022 0:57 UTC; 11 points) (
- 16 Jul 2019 5:09 UTC; 11 points) 's comment on A Key Power of the President is to Coordinate the Execution of Existing Concrete Plans by (
- Pros and cons of working on near-term technical AI safety and assurance by 17 Jun 2021 20:17 UTC; 11 points) (
- 8 Oct 2019 17:58 UTC; 11 points) 's comment on AI Alignment Open Thread October 2019 by (
- 8 May 2020 22:53 UTC; 11 points) 's comment on AI Boxing for Hardware-bound agents (aka the China alignment problem) by (
- 1 Apr 2020 9:30 UTC; 11 points) 's comment on My current framework for thinking about AGI timelines by (
- Is AI Safety dropping the ball on privacy? by 19 Sep 2023 8:17 UTC; 10 points) (EA Forum;
- Large Language Models as Corporate Lobbyists, and Implications for Societal-AI Alignment by 4 Jan 2023 22:22 UTC; 10 points) (EA Forum;
- Measuring Coherence and Goal-Directedness in RL Policies by 22 Apr 2024 18:26 UTC; 10 points) (
- Responding to ‘Beyond Hyperanthropomorphism’ by 14 Sep 2022 20:37 UTC; 9 points) (
- Eli Lifland on Navigating the AI Alignment Landscape by 1 Feb 2023 21:17 UTC; 9 points) (
- 31 Dec 2022 1:25 UTC; 9 points) 's comment on Zach Stein-Perlman’s Shortform by (
- 7 Jul 2019 19:44 UTC; 9 points) 's comment on A shift in arguments for AI risk by (
- Introducing the Existential Risks Introductory Course (ERIC) by 19 Aug 2022 15:54 UTC; 9 points) (
- 9 Apr 2023 17:42 UTC; 9 points) 's comment on All AGI Safety questions welcome (especially basic ones) [April 2023] by (
- More to explore on ‘Risks from Artificial Intelligence’ by 15 Jul 2022 23:00 UTC; 8 points) (EA Forum;
- Do EA folks want AGI at all? by 16 Jul 2022 5:44 UTC; 8 points) (EA Forum;
- 16 Dec 2021 3:32 UTC; 8 points) 's comment on My Overview of the AI Alignment Landscape: A Bird’s Eye View by (EA Forum;
- On Preference Manipulation in Reward Learning Processes by 15 Aug 2022 19:32 UTC; 8 points) (
- 6 Jan 2020 18:35 UTC; 8 points) 's comment on [AN #80]: Why AI risk might be solved without additional intervention from longtermists by (
- 22 Sep 2020 15:49 UTC; 7 points) 's comment on Forecasting Thread: Existential Risk by (
- 27 May 2023 0:37 UTC; 7 points) 's comment on Where do you lie on two axes of world manipulability? by (
- 21 Jul 2020 8:18 UTC; 6 points) 's comment on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher by (EA Forum;
- Thinking of tool AIs by 20 Nov 2019 21:47 UTC; 6 points) (
- Acknowledgements & References by 14 Dec 2019 7:04 UTC; 6 points) (
- 30 Jul 2019 17:50 UTC; 6 points) 's comment on Does it become easier, or harder, for the world to coordinate around not building AGI as time goes on? by (
- 6 Sep 2020 1:08 UTC; 6 points) 's comment on Tofly’s Shortform by (
- 15 Apr 2021 12:14 UTC; 6 points) 's comment on Homogeneity vs. heterogeneity in AI takeoff scenarios by (
- 8 Oct 2019 17:59 UTC; 6 points) 's comment on AI Alignment Open Thread October 2019 by (
- [Crosspost] A recent write-up of the case for AI (existential) risk by 18 May 2023 13:13 UTC; 6 points) (
- 18 Jul 2020 23:34 UTC; 5 points) 's comment on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher by (EA Forum;
- 19 Jul 2020 13:55 UTC; 5 points) 's comment on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher by (EA Forum;
- 11 Apr 2021 18:12 UTC; 5 points) 's comment on What does failure look like? by (EA Forum;
- 8 Aug 2022 23:47 UTC; 5 points) 's comment on Classifying sources of AI x-risk by (EA Forum;
- Evidence other than evolution for optimization daemons? by 21 Apr 2019 20:50 UTC; 5 points) (
- 16 Jan 2020 23:16 UTC; 4 points) 's comment on Impact measurement and value-neutrality verification by (
- 15 Dec 2022 20:13 UTC; 3 points) 's comment on High-level hopes for AI alignment by (
- 23 Sep 2020 6:42 UTC; 3 points) 's comment on AI Advantages [Gems from the Wiki] by (
- 28 Apr 2023 19:05 UTC; 3 points) 's comment on Realistic near-future scenarios of AI doom understandable for non-techy people? by (
- 29 Dec 2019 0:13 UTC; 3 points) 's comment on Free Speech and Triskaidekaphobic Calculators: A Reply to Hubinger on the Relevance of Public Online Discussion to Existential Risk by (
- 長期主義的AIガバナンスの展望:基礎的な概要 by 17 Aug 2023 15:16 UTC; 2 points) (EA Forum;
- 8 Sep 2019 14:33 UTC; 2 points) 's comment on Alien colonization of Earth’s impact the the relative importance of reducing different existential risks by (EA Forum;
- 2 Mar 2022 22:18 UTC; 2 points) 's comment on We’re Aligned AI, AMA by (EA Forum;
- 12 Nov 2022 17:56 UTC; 2 points) 's comment on The FTX Future Fund team has resigned by (EA Forum;
- 4 Apr 2023 16:47 UTC; 2 points) 's comment on Wizards and prophets of AI [draft for comment] by (Progress Forum;
- 9 Jun 2022 4:15 UTC; 2 points) 's comment on PASTA and Progress: The great irony by (Progress Forum;
- 7 Apr 2020 19:24 UTC; 2 points) 's comment on Core Tag Examples [temporary] by (
- 12 Jan 2021 5:58 UTC; 2 points) 's comment on What are the open problems in Human Rationality? by (
- 21 Feb 2021 5:12 UTC; 2 points) 's comment on 2019 Review: Voting Results! by (
- AI Safety (Week 3, AI Threat Modeling) - LW/ACX Meetup #198 (Wednesday, Aug 17th) by 15 Aug 2022 4:17 UTC; 2 points) (
- When is AI safety research harmful? by 9 May 2022 18:19 UTC; 2 points) (
- 25 Jul 2024 13:58 UTC; 2 points) 's comment on The case for stopping AI safety research by (
- 18 Mar 2020 22:39 UTC; 1 point) 's comment on My personal cruxes for working on AI safety by (EA Forum;
- 14 Aug 2024 23:37 UTC; 1 point) 's comment on Creating a “Conscience Calculator” to Guard-Rail an AGI by (EA Forum;
- [Opzionale] Approfondimenti sui rischi dell’IA (materiali in inglese) by 18 Jan 2023 11:16 UTC; 1 point) (EA Forum;
- Prevenire una catastrofe legata all’intelligenza artificiale by 17 Jan 2023 11:07 UTC; 1 point) (EA Forum;
- 4 Nov 2022 8:20 UTC; 1 point) 's comment on All AGI Safety questions welcome (especially basic ones) [~monthly thread] by (EA Forum;
- 3 Dec 2019 14:13 UTC; 1 point) 's comment on Misconceptions about continuous takeoff by (
- The Market Singularity: A New Perspective by 30 May 2024 7:05 UTC; 1 point) (
- 29 Oct 2019 22:17 UTC; 1 point) 's comment on Impact measurement and value-neutrality verification by (
- 8 Apr 2023 5:01 UTC; -3 points) 's comment on Bringing Agency Into AGI Extinction Is Superfluous by (
- 23 Nov 2021 21:45 UTC; -4 points) 's comment on Yudkowsky and Christiano discuss “Takeoff Speeds” by (
- How AGI will actually end us: Some predictions on evolution by artificial selection by 10 Apr 2023 13:52 UTC; -11 points) (
As commenters have pointed out, the post is light on concrete details. Nonetheless, I found even the abstract stories much more compelling as descriptions-of-the-future (people usually focus on descriptions-of-the-world-if-we-bury-our-heads-in-the-sand). I think Part 2 in particular continues to be a good abstract description of the type of scenario that I personally am trying to avert.
Students of Yudkowsky have long contemplated hard-takeoff scenarios where a single AI bootstraps itself to superintelligence from a world much like our own. This post is valuable for explaining how the intrinsic risks might play out in a soft-takeoff scenario where AI has already changed Society.
Part I is a dark mirror of Christiano’s 2013 “Why Might the Future Be Good?”: the whole economy “takes off”, and the question is how humane-aligned does the system remain before it gets competent enough to lock in its values. (“Why might the future” says “Mostly”, “What Failure Looks Like” pt. I says “Not”.)
When I first read this post, I didn’t feel like I “got” Part II, but now I think I do. (It’s the classic “treacherous turn”, but piecemeal across Society in different systems, rather than in a single seed superintelligence.)