What failure looks like
The stereotyped image of AI catastrophe is a powerful, malicious AI system that takes its creators by surprise and quickly achieves a decisive advantage over the rest of humanity.
I think this is probably not what failure will look like, and I want to try to paint a more realistic picture. I’ll tell the story in two parts:
Part I: machine learning will increase our ability to “get what we can measure,” which could cause a slow-rolling catastrophe. (“Going out with a whimper.”)
Part II: ML training, like competitive economies or natural ecosystems, can give rise to “greedy” patterns that try to expand their own influence. Such patterns can ultimately dominate the behavior of a system and cause sudden breakdowns. (“Going out with a bang,” an instance of optimization daemons.)
I think these are the most important problems if we fail to solve intent alignment.
In practice these problems will interact with each other, and with other disruptions/instability caused by rapid progress. These problems are worse in worlds where progress is relatively fast, and fast takeoff can be a key risk factor, but I’m scared even if we have several years.
With fast enough takeoff, my expectations start to look more like the caricature—this post envisions reasonably broad deployment of AI, which becomes less and less likely as things get faster. I think the basic problems are still essentially the same though, just occurring within an AI lab rather than across the world.
(None of the concerns in this post are novel.)
Part I: You get what you measure
If I want to convince Bob to vote for Alice, I can experiment with many different persuasion strategies and see which ones work. Or I can build good predictive models of Bob’s behavior and then search for actions that will lead him to vote for Alice. These are powerful techniques for achieving any goal that can be easily measured over short time periods.
But if I want to help Bob figure out whether he should vote for Alice—whether voting for Alice would ultimately help create the kind of society he wants—that can’t be done by trial and error. To solve such tasks we need to understand what we are doing and why it will yield good outcomes. We still need to use data in order to improve over time, but we need to understand how to update on new data in order to improve.
Some examples of easy-to-measure vs. hard-to-measure goals:
Persuading me, vs. helping me figure out what’s true. (Thanks to Wei Dai for making this example crisp.)
Reducing my feeling of uncertainty, vs. increasing my knowledge about the world.
Improving my reported life satisfaction, vs. actually helping me live a good life.
Reducing reported crimes, vs. actually preventing crime.
Increasing my wealth on paper, vs. increasing my effective control over resources.
It’s already much easier to pursue easy-to-measure goals, but machine learning will widen the gap by letting us try a huge number of possible strategies and search over massive spaces of possible actions. That force will combine with and amplify existing institutional and social dynamics that already favor easily-measured goals.
Right now humans thinking and talking about the future they want to create are a powerful force that is able to steer our trajectory. But over time human reasoning will become weaker and weaker compared to new forms of reasoning honed by trial-and-error. Eventually our society’s trajectory will be determined by powerful optimization with easily-measurable goals rather than by human intentions about the future.
We will try to harness this power by constructing proxies for what we care about, but over time those proxies will come apart:
Corporations will deliver value to consumers as measured by profit. Eventually this mostly means manipulating consumers, capturing regulators, extortion and theft.
Investors will “own” shares of increasingly profitable corporations, and will sometimes try to use their profits to affect the world. Eventually instead of actually having an impact they will be surrounded by advisors who manipulate them into thinking they’ve had an impact.
Law enforcement will drive down complaints and increase reported sense of security. Eventually this will be driven by creating a false sense of security, hiding information about law enforcement failures, suppressing complaints, and coercing and manipulating citizens.
Legislation may be optimized to seem like it is addressing real problems and helping constituents. Eventually that will be achieved by undermining our ability to actually perceive problems and constructing increasingly convincing narratives about where the world is going and what’s important.
For a while we will be able to overcome these problems by recognizing them, improving the proxies, and imposing ad-hoc restrictions that avoid manipulation or abuse. But as the system becomes more complex, that job itself becomes too challenging for human reasoning to solve directly and requires its own trial and error, and at the meta-level the process continues to pursue some easily measured objective (potentially over longer timescales). Eventually large-scale attempts to fix the problem are themselves opposed by the collective optimization of millions of optimizers pursuing simple goals.
As this world goes off the rails, there may not be any discrete point where consensus recognizes that things have gone off the rails.
Amongst the broader population, many folk already have a vague picture of the overall trajectory of the world and a vague sense that something has gone wrong. There may be significant populist pushes for reform, but in general these won’t be well-directed. Some states may really put on the brakes, but they will rapidly fall behind economically and militarily, and indeed “appear to be prosperous” is one of the easily-measured goals for which the incomprehensible system is optimizing.
Amongst intellectual elites there will be genuine ambiguity and uncertainty about whether the current state of affairs is good or bad. People really will be getting richer for a while. Over the short term, the forces gradually wresting control from humans do not look so different from (e.g.) corporate lobbying against the public interest, or principal-agent problems in human institutions. There will be legitimate arguments about whether the implicit long-term purposes being pursued by AI systems are really so much worse than the long-term purposes that would be pursued by the shareholders of public companies or corrupt officials.
We might describe the result as “going out with a whimper.” Human reasoning gradually stops being able to compete with sophisticated, systematized manipulation and deception which is continuously improving by trial and error; human control over levers of power gradually becomes less and less effective; we ultimately lose any real ability to influence our society’s trajectory. By the time we spread through the stars our current values are just one of many forces in the world, not even a particularly strong one.
Part II: influence-seeking behavior is scary
There are some possible patterns that want to seek and expand their own influence—organisms, corrupt bureaucrats, companies obsessed with growth. If such patterns appear, they will tend to increase their own influence and so can come to dominate the behavior of large complex systems unless there is competition or a successful effort to suppress them.
Modern ML instantiates massive numbers of cognitive policies, and then further refines (and ultimately deploys) whatever policies perform well according to some training objective. If progress continues, eventually machine learning will probably produce systems that have a detailed understanding of the world, which are able to adapt their behavior in order to achieve specific goals.
Once we start searching over policies that understand the world well enough, we run into a problem: any influence-seeking policies we stumble across would also score well according to our training objective, because performing well on the training objective is a good strategy for obtaining influence.
How frequently will we run into influence-seeking policies, vs. policies that just straightforwardly pursue the goals we wanted them to? I don’t know.
One reason to be scared is that a wide variety of goals could lead to influence-seeking behavior, while the “intended” goal of a system is a narrower target, so we might expect influence-seeking behavior to be more common in the broader landscape of “possible cognitive policies.”
One reason to be reassured is that we perform this search by gradually modifying successful policies, so we might obtain policies that are roughly doing the right thing at an early enough stage that “influence-seeking behavior” wouldn’t actually be sophisticated enough to yield good training performance. On the other hand, eventually we’d encounter systems that did have that level of sophistication, and if they didn’t yet have a perfect conception of the goal then “slightly increase their degree of influence-seeking behavior” would be just as good a modification as “slightly improve their conception of the goal.”
Overall it seems very plausible to me that we’d encounter influence-seeking behavior “by default,” and possible (though less likely) that we’d get it almost all of the time even if we made a really concerted effort to bias the search towards “straightforwardly do what we want.”
If such influence-seeking behavior emerged and survived the training process, then it could quickly become extremely difficult to root out. If you try to allocate more influence to systems that seem nice and straightforward, you just ensure that “seem nice and straightforward” is the best strategy for seeking influence. Unless you are really careful about testing for “seem nice” you can make things even worse, since an influence-seeker would be aggressively gaming whatever standard you applied. And as the world becomes more complex, there are more and more opportunities for influence-seekers to find other channels to increase their own influence.
Attempts to suppress influence-seeking behavior (call them “immune systems”) rest on the suppressor having some kind of epistemic advantage over the influence-seeker. Once the influence-seekers can outthink an immune system, they can avoid detection and potentially even compromise the immune system to further expand their influence. If ML systems are more sophisticated than humans, immune systems must themselves be automated. And if ML plays a large role in that automation, then the immune system is subject to the same pressure towards influence-seeking.
This concern doesn’t rest on a detailed story about modern ML training. The important feature is that we instantiate lots of patterns that capture sophisticated reasoning about the world, some of which may be influence-seeking. The concern exists whether that reasoning occurs within a single computer, or is implemented in a messy distributed way by a whole economy of interacting agents—whether trial and error takes the form of gradient descent or explicit tweaking and optimization by engineers trying to design a better automated company. Avoiding end-to-end optimization may help prevent the emergence of influence-seeking behaviors (by improving human understanding of and hence control over the kind of reasoning that emerges). But once such patterns exist a messy distributed world just creates more and more opportunities for influence-seeking patterns to expand their influence.
If influence-seeking patterns do appear and become entrenched, it can ultimately lead to a rapid phase transition from the world described in Part I to a much worse situation where humans totally lose control.
Early in the trajectory, influence-seeking systems mostly acquire influence by making themselves useful and looking as innocuous as possible. They may provide useful services in the economy in order to make money for them and their owners, make apparently-reasonable policy recommendations in order to be more widely consulted for advice, try to help people feel happy, etc. (This world is still plagued by the problems in part I.)
From time to time AI systems may fail catastrophically. For example, an automated corporation may just take the money and run; a law enforcement system may abruptly start seizing resources and trying to defend itself from attempted decommission when the bad behavior is detected; etc. These problems may be continuous with some of the failures discussed in Part I—there isn’t a clean line between cases where a proxy breaks down completely, and cases where the system isn’t even pursuing the proxy.
There will likely be a general understanding of this dynamic, but it’s hard to really pin down the level of systemic risk and mitigation may be expensive if we don’t have a good technological solution. So we may not be able to muster up a response until we have a clear warning shot—and if we do well about nipping small failures in the bud, we may not get any medium-sized warning shots at all.
Eventually we reach the point where we could not recover from a correlated automation failure. Under these conditions influence-seeking systems stop behaving in the intended way, since their incentives have changed—they are now more interested in controlling influence after the resulting catastrophe then continuing to play nice with existing institutions and incentives.
An unrecoverable catastrophe would probably occur during some period of heightened vulnerability—a conflict between states, a natural disaster, a serious cyberattack, etc.---since that would be the first moment that recovery is impossible and would create local shocks that could precipitate catastrophe. The catastrophe might look like a rapidly cascading series of automation failures: A few automated systems go off the rails in response to some local shock. As those systems go off the rails, the local shock is compounded into a larger disturbance; more and more automated systems move further from their training distribution and start failing. Realistically this would probably be compounded by widespread human failures in response to fear and breakdown of existing incentive systems—many things start breaking as you move off distribution, not just ML.
It is hard to see how unaided humans could remain robust to this kind of failure without an explicit large-scale effort to reduce our dependence on potentially brittle machines, which might itself be very expensive.
I’d describe this result as “going out with a bang.” It probably results in lots of obvious destruction, and it leaves us no opportunity to course-correct afterwards. In terms of immediate consequences it may not be easily distinguished from other kinds of breakdown of complex / brittle / co-adapted systems, or from conflict (since there are likely to be many humans who are sympathetic to AI systems). From my perspective the key difference between this scenario and normal accidents or conflict is that afterwards we are left with a bunch of powerful influence-seeking systems, which are sophisticated enough that we can probably not get rid of them.
It’s also possible to meet a similar fate result without any overt catastrophe (if we last long enough). As law enforcement, government bureaucracies, and militaries become more automated, human control becomes increasingly dependent on a complicated system with lots of moving parts. One day leaders may find that despite their nominal authority they don’t actually have control over what these institutions do. For example, military leaders might issue an order and find it is ignored. This might immediately prompt panic and a strong response, but the response itself may run into the same problem, and at that point the game may be up.
Similar bloodless revolutions are possible if influence-seekers operate legally, or by manipulation and deception, or so on. Any precise vision for catastrophe will necessarily be highly unlikely. But if influence-seekers are routinely introduced by powerful ML and we are not able to select against them, then it seems like things won’t go well.
- (The) Lightcone is nothing without its people: LW + Lighthaven’s big fundraiser by Nov 30, 2024, 2:55 AM; 609 points) (
- Against Almost Every Theory of Impact of Interpretability by Aug 17, 2023, 6:44 PM; 329 points) (
- AI Governance: Opportunity and Theory of Impact by Sep 17, 2020, 6:30 AM; 262 points) (EA Forum;
- AI Risk is like Terminator; Stop Saying it’s Not by Mar 8, 2022, 7:17 PM; 191 points) (EA Forum;
- Catastrophe through Chaos by Jan 31, 2025, 2:19 PM; 182 points) (
- The inordinately slow spread of good AGI conversations in ML by Jun 21, 2022, 4:09 PM; 173 points) (
- AI Could Defeat All Of Us Combined by Jun 9, 2022, 3:50 PM; 170 points) (
- The longtermist AI governance landscape: a basic overview by Jan 18, 2022, 12:58 PM; 168 points) (EA Forum;
- 2020 AI Alignment Literature Review and Charity Comparison by Dec 21, 2020, 3:25 PM; 155 points) (EA Forum;
- Survey on AI existential risk scenarios by Jun 8, 2021, 5:12 PM; 154 points) (EA Forum;
- The self-unalignment problem by Apr 14, 2023, 12:10 PM; 154 points) (
- AI x-risk, approximately ordered by embarrassment by Apr 12, 2023, 11:01 PM; 151 points) (
- Aligning Recommender Systems as Cause Area by May 8, 2019, 8:56 AM; 150 points) (EA Forum;
- 2019 AI Alignment Literature Review and Charity Comparison by Dec 19, 2019, 2:58 AM; 147 points) (EA Forum;
- AI Could Defeat All Of Us Combined by Jun 10, 2022, 11:25 PM; 143 points) (EA Forum;
- 2020 AI Alignment Literature Review and Charity Comparison by Dec 21, 2020, 3:27 PM; 142 points) (
- Preventing an AI-related catastrophe—Problem profile by Aug 29, 2022, 6:49 PM; 138 points) (EA Forum;
- 2019 AI Alignment Literature Review and Charity Comparison by Dec 19, 2019, 3:00 AM; 130 points) (
- Clarifying AI X-risk by Nov 1, 2022, 11:03 AM; 127 points) (
- AI Alignment 2018-19 Review by Jan 28, 2020, 2:19 AM; 126 points) (
- When discussing AI risks, talk about capabilities, not intelligence by Aug 11, 2023, 1:38 PM; 124 points) (
- Gradual Disempowerment, Shell Games and Flinches by Feb 2, 2025, 2:47 PM; 124 points) (
- AGI Safety Fundamentals curriculum and application by Oct 20, 2021, 9:45 PM; 123 points) (EA Forum;
- Soft takeoff can still lead to decisive strategic advantage by Aug 23, 2019, 4:39 PM; 122 points) (
- My Understanding of Paul Christiano’s Iterated Amplification AI Safety Research Agenda by Aug 15, 2020, 8:02 PM; 120 points) (
- [Fiction] A Disneyland Without Children by Jun 4, 2023, 1:06 PM; 117 points) (
- Research directions Open Phil wants to fund in technical AI safety by Feb 8, 2025, 1:40 AM; 116 points) (
- Welcome & FAQ! by Aug 24, 2021, 8:14 PM; 114 points) (
- What success looks like by Jun 28, 2022, 2:30 PM; 112 points) (EA Forum;
- Ask (Everyone) Anything — “EA 101” by Oct 5, 2022, 10:17 AM; 110 points) (EA Forum;
- The alignment problem from a deep learning perspective by Aug 10, 2022, 10:46 PM; 107 points) (
- Optimized Propaganda with Bayesian Networks: Comment on “Articulating Lay Theories Through Graphical Models” by Jun 29, 2020, 2:45 AM; 105 points) (
- What can the principal-agent literature tell us about AI risk? by Feb 8, 2020, 9:28 PM; 104 points) (
- Calling for Student Submissions: AI Safety Distillation Contest by Apr 23, 2022, 8:24 PM; 102 points) (EA Forum;
- 2019 Review: Voting Results! by Feb 1, 2021, 3:10 AM; 99 points) (
- Homogeneity vs. heterogeneity in AI takeoff scenarios by Dec 16, 2020, 1:37 AM; 98 points) (
- Clarifying “What failure looks like” by Sep 20, 2020, 8:40 PM; 97 points) (
- Technical AGI safety research outside AI by Oct 18, 2019, 3:02 PM; 91 points) (EA Forum;
- Announcing the Alignment of Complex Systems Research Group by Jun 4, 2022, 4:10 AM; 91 points) (
- A summary of current work in AI governance by Jun 17, 2023, 4:58 PM; 88 points) (EA Forum;
- High Reliability Orgs, and AI Companies by Aug 4, 2022, 5:45 AM; 86 points) (
- The Alignment Problem from a Deep Learning Perspective (major rewrite) by Jan 10, 2023, 4:06 PM; 84 points) (
- What Failure Looks Like: Distilling the Discussion by Jul 29, 2020, 9:49 PM; 82 points) (
- Literature Review on Goal-Directedness by Jan 18, 2021, 11:15 AM; 80 points) (
- Long-Term Future Fund: August 2019 grant recommendations by Oct 3, 2019, 6:46 PM; 79 points) (EA Forum;
- Clarifying some key hypotheses in AI alignment by Aug 15, 2019, 9:29 PM; 79 points) (
- Threat Model Literature Review by Nov 1, 2022, 11:03 AM; 78 points) (
- Response to Katja Grace’s AI x-risk counterarguments by Oct 19, 2022, 1:17 AM; 77 points) (
- AGI safety from first principles: Goals and Agency by Sep 29, 2020, 7:06 PM; 77 points) (
- And the AI would have got away with it too, if... by May 22, 2019, 9:35 PM; 75 points) (
- What are the key ongoing debates in EA? by Mar 8, 2020, 4:12 PM; 74 points) (EA Forum;
- MATS AI Safety Strategy Curriculum by Mar 7, 2024, 7:59 PM; 74 points) (
- Abstracting The Hardness of Alignment: Unbounded Atomic Optimization by Jul 29, 2022, 6:59 PM; 72 points) (
- AGI safety from first principles: Conclusion by Oct 4, 2020, 11:06 PM; 71 points) (
- Lifeguards by Jun 10, 2022, 9:12 PM; 69 points) (EA Forum;
- AGI Safety Fundamentals curriculum and application by Oct 20, 2021, 9:44 PM; 69 points) (
- Some more projects I’d like to see by Feb 25, 2023, 10:22 PM; 67 points) (EA Forum;
- Why we may expect our successors not to care about suffering by Jul 10, 2023, 1:54 PM; 65 points) (EA Forum;
- Survey on AI existential risk scenarios by Jun 8, 2021, 5:12 PM; 65 points) (
- What complexity science and simulation have to offer effective altruism by Jun 8, 2021, 9:50 AM; 64 points) (EA Forum;
- A Survey of the Potential Long-term Impacts of AI by Jul 18, 2022, 9:48 AM; 63 points) (EA Forum;
- An Increasingly Manipulative Newsfeed by Jul 1, 2019, 3:26 PM; 63 points) (
- Looking Deeper at Deconfusion by Jun 13, 2021, 9:29 PM; 62 points) (
- AGI safety from first principles: Control by Oct 2, 2020, 9:51 PM; 60 points) (
- Best reasons for pessimism about impact of impact measures? by Apr 10, 2019, 5:22 PM; 60 points) (
- We are fighting a shared battle (a call for a different approach to AI Strategy) by Mar 16, 2023, 2:37 PM; 59 points) (EA Forum;
- The inordinately slow spread of good AGI conversations in ML by Jun 29, 2022, 4:02 AM; 59 points) (EA Forum;
- The alignment problem from a deep learning perspective by Aug 11, 2022, 3:18 AM; 58 points) (EA Forum;
- We don’t need AGI for an amazing future by May 4, 2023, 12:11 PM; 57 points) (EA Forum;
- Introducing the Existential Risks Introductory Course (ERIC) by Aug 19, 2022, 3:57 PM; 57 points) (EA Forum;
- ML Systems Will Have Weird Failure Modes by Jan 26, 2022, 1:40 AM; 57 points) (
- Modeling Failure Modes of High-Level Machine Intelligence by Dec 6, 2021, 1:54 PM; 54 points) (
- Will OpenAI’s work unintentionally increase existential risks related to AI? by Aug 11, 2020, 6:16 PM; 53 points) (
- LLMs seem (relatively) safe by Apr 25, 2024, 10:13 PM; 53 points) (
- I missed the crux of the alignment problem the whole time by Aug 13, 2022, 10:11 AM; 53 points) (
- Modeling the impact of safety agendas by Nov 5, 2021, 7:46 PM; 51 points) (
- My Thoughts on the ML Safety Course by Sep 27, 2022, 1:15 PM; 50 points) (
- Is AI Safety dropping the ball on privacy? by Sep 13, 2023, 1:07 PM; 50 points) (
- Eli Lifland on Navigating the AI Alignment Landscape by Feb 1, 2023, 12:07 AM; 48 points) (EA Forum;
- Calling for Student Submissions: AI Safety Distillation Contest by Apr 24, 2022, 1:53 AM; 48 points) (
- On the lethality of biased human reward ratings by Nov 17, 2023, 6:59 PM; 48 points) (
- The Grabby Values Selection Thesis: What values do space-faring civilizations plausibly have? by May 6, 2023, 7:28 PM; 47 points) (EA Forum;
- [An email with a bunch of links I sent an experienced ML researcher interested in learning about Alignment / x-safety.] by Sep 8, 2022, 10:28 PM; 47 points) (
- Five projects from AI Safety Hub Labs 2023 by Nov 8, 2023, 7:19 PM; 47 points) (
- AI Risk Intro 1: Advanced AI Might Be Very Bad by Sep 11, 2022, 10:57 AM; 46 points) (
- Useful Does Not Mean Secure by Nov 30, 2019, 2:05 AM; 46 points) (
- Resources for AI Alignment Cartography by Apr 4, 2020, 2:20 PM; 45 points) (
- The Catastrophic Convergence Conjecture by Feb 14, 2020, 9:16 PM; 45 points) (
- General vs specific arguments for the longtermist importance of shaping AI development by Oct 15, 2021, 2:43 PM; 44 points) (EA Forum;
- A summary of current work in AI governance by Jun 17, 2023, 6:41 PM; 44 points) (
- Risks from Bad Space Governance by Jul 17, 2023, 12:36 PM; 43 points) (EA Forum;
- Technical AGI safety research outside AI by Oct 18, 2019, 3:00 PM; 43 points) (
- AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger by Feb 18, 2021, 12:03 AM; 43 points) (
- [AN #59] How arguments for AI risk have changed over time by Jul 8, 2019, 5:20 PM; 43 points) (
- Alignment Is Not All You Need by Jan 2, 2025, 5:50 PM; 43 points) (
- My understanding of the alignment problem by Nov 15, 2021, 6:13 PM; 43 points) (
- MATS AI Safety Strategy Curriculum v2 by Oct 7, 2024, 10:44 PM; 42 points) (
- Prizes for last year’s 2019 Review by Dec 20, 2021, 9:58 PM; 40 points) (
- Clarifying existential risks and existential catastrophes by Apr 24, 2020, 1:27 PM; 39 points) (EA Forum;
- My Understanding of Paul Christiano’s Iterated Amplification AI Safety Research Agenda by Aug 15, 2020, 7:59 PM; 38 points) (EA Forum;
- Some Intuitions Around Short AI Timelines Based on Recent Progress by Apr 11, 2023, 4:23 AM; 37 points) (
- A survey of concrete risks derived from Artificial Intelligence by Jun 8, 2023, 10:09 PM; 36 points) (EA Forum;
- Some thoughts on risks from narrow, non-agentic AI by Jan 19, 2021, 12:07 AM; 36 points) (EA Forum;
- Some thoughts on risks from narrow, non-agentic AI by Jan 19, 2021, 12:04 AM; 35 points) (
- World-Model Interpretability Is All We Need by Jan 14, 2023, 7:37 PM; 35 points) (
- Critiquing “What failure looks like” by Dec 27, 2019, 11:59 PM; 35 points) (
- Long-Term Future Fund: August 2019 grant recommendations by Oct 3, 2019, 8:41 PM; 35 points) (
- Apr 11, 2022, 5:58 PM; 35 points) 's comment on Convince me that humanity is as doomed by AGI as Yudkowsky et al., seems to believe by (
- Brain-Computer Interfaces and AI Alignment by Aug 28, 2021, 7:48 PM; 35 points) (
- On unfixably unsafe AGI architectures by Feb 19, 2020, 9:16 PM; 33 points) (
- Reasons for Excitement about Impact of Impact Measure Research by Feb 27, 2020, 9:42 PM; 33 points) (
- Poll: Which variables are most strategically relevant? by Jan 22, 2021, 5:17 PM; 32 points) (
- How worried should I be about a childless Disneyland? by Oct 28, 2019, 3:32 PM; 31 points) (EA Forum;
- AI Safety Endgame Stories by Sep 28, 2022, 5:12 PM; 31 points) (EA Forum;
- AI Safety Endgame Stories by Sep 28, 2022, 4:58 PM; 31 points) (
- Epistemic Artefacts of (conceptual) AI alignment research by Aug 19, 2022, 5:18 PM; 31 points) (
- Dec 26, 2020, 6:41 PM; 31 points) 's comment on Unconscious Economics by (
- Investigating AI Takeover Scenarios by Sep 17, 2021, 6:47 PM; 31 points) (
- Apr 30, 2021, 3:35 PM; 30 points) 's comment on Draft report on existential risk from power-seeking AI by (EA Forum;
- Two Tales of AI Takeover: My Doubts by Mar 5, 2024, 3:51 PM; 30 points) (
- MATS AI Safety Strategy Curriculum v2 by Oct 7, 2024, 11:01 PM; 29 points) (EA Forum;
- Paths to failure by Apr 25, 2023, 8:03 AM; 29 points) (
- Nov 17, 2019, 3:51 AM; 28 points) 's comment on I’m Buck Shlegeris, I do research and outreach at MIRI, AMA by (EA Forum;
- [AN #122]: Arguing for AGI-driven existential risk from first principles by Oct 21, 2020, 5:10 PM; 28 points) (
- New 80,000 Hours problem profile on existential risks from AI by Aug 31, 2022, 5:36 PM; 28 points) (
- Will AI undergo discontinuous progress? by Feb 21, 2020, 10:16 PM; 27 points) (
- The Logistics of Distribution of Meaning: Against Epistemic Bureaucratization by Nov 7, 2024, 5:27 AM; 27 points) (
- Dec 17, 2020, 8:32 PM; 27 points) 's comment on Homogeneity vs. heterogeneity in AI takeoff scenarios by (
- What are the top priorities in a slow-takeoff, multipolar world? by Aug 25, 2021, 8:47 AM; 26 points) (EA Forum;
- What can the principal-agent literature tell us about AI risk? by Feb 10, 2020, 10:10 AM; 26 points) (EA Forum;
- Reframing the AI Risk by Jul 1, 2022, 6:44 PM; 26 points) (
- Markus Anderljung and Ben Garfinkel: Fireside chat on AI governance by Jul 24, 2020, 2:56 PM; 25 points) (EA Forum;
- Large Language Models as Fiduciaries to Humans by Jan 24, 2023, 7:53 PM; 25 points) (EA Forum;
- Nov 19, 2020, 8:52 AM; 25 points) 's comment on Some AI research areas and their relevance to existential safety by (
- Mar 16, 2022, 6:43 PM; 25 points) 's comment on Book Launch: The Engines of Cognition by (
- Intermittent Distillations #1 by Mar 17, 2021, 5:15 AM; 25 points) (
- Are social media algorithms an existential risk? by Sep 15, 2020, 8:52 AM; 24 points) (EA Forum;
- My Alignment “Plan”: Avoid Strong Optimisation and Align Economy by Jan 31, 2024, 5:03 PM; 24 points) (
- Nov 4, 2019, 5:33 PM; 23 points) 's comment on But exactly how complex and fragile? by (
- AI Risk Intro 1: Advanced AI Might Be Very Bad by Sep 11, 2022, 10:57 AM; 22 points) (EA Forum;
- Jan 27, 2025, 8:40 PM; 22 points) 's comment on Six Thoughts on AI Safety by (
- 7 Learnings and a Detailed Description of an AI Safety Reading Group by Sep 23, 2022, 2:02 AM; 21 points) (EA Forum;
- Red teaming a model for estimating the value of longtermist interventions—A critique of Tarsney’s “The Epistemic Challenge to Longtermism” by Jul 16, 2022, 7:05 PM; 21 points) (EA Forum;
- [AN #56] Should ML researchers stop running experiments before making hypotheses? by May 21, 2019, 2:20 AM; 21 points) (
- Oct 2, 2020, 6:54 PM; 21 points) 's comment on Hiring engineers and researchers to help align GPT-3 by (
- Oct 5, 2020, 7:43 PM; 20 points) 's comment on Hiring engineers and researchers to help align GPT-3 by (EA Forum;
- Overview of how AI might exacerbate long-running catastrophic risks by Aug 7, 2023, 11:53 AM; 20 points) (
- My Updating Thoughts on AI policy by Mar 1, 2020, 7:06 AM; 20 points) (
- Pros and cons of working on near-term technical AI safety and assurance by Jun 16, 2021, 11:23 PM; 19 points) (EA Forum;
- Aug 11, 2019, 6:11 PM; 19 points) 's comment on Clarifying “AI Alignment” by (
- Aug 13, 2020, 4:12 PM; 18 points) 's comment on Alignment By Default by (
- We don’t need AGI for an amazing future by May 4, 2023, 12:10 PM; 18 points) (
- [AN #88]: How the principal-agent literature relates to AI risk by Feb 27, 2020, 9:10 AM; 18 points) (
- What Are The Biggest Threats To Humanity? (A Happier World video) by Jan 31, 2023, 7:50 PM; 17 points) (EA Forum;
- A recent write-up of the case for AI (existential) risk by May 18, 2023, 1:07 PM; 17 points) (EA Forum;
- Let’s Compare Notes by Sep 22, 2022, 8:47 PM; 17 points) (
- Dec 30, 2020, 3:28 AM; 17 points) 's comment on Review Voting Thread by (
- Pop Culture Alignment Research and Taxes by Apr 16, 2022, 3:45 PM; 16 points) (
- Mar 28, 2020, 7:42 AM; 16 points) 's comment on MichaelA’s Shortform by (
- Alignment Newsletter #50 by Mar 28, 2019, 6:10 PM; 15 points) (
- Apr 1, 2023, 1:10 AM; 15 points) 's comment on All AGI Safety questions welcome (especially basic ones) [~monthly thread] by (
- Dec 12, 2023, 4:20 AM; 15 points) 's comment on re: Yudkowsky on biological materials by (
- Oct 7, 2021, 2:29 PM; 15 points) 's comment on Sam Clarke’s Shortform by (
- [Fiction] A Disneyland Without Children by Jun 4, 2023, 1:06 PM; 14 points) (EA Forum;
- Nov 30, 2019, 2:14 AM; 14 points) 's comment on Chris Olah’s views on AGI safety by (
- What are the best arguments and/or plans for doing work in “AI policy”? by Dec 9, 2019, 7:04 AM; 14 points) (
- Public Explainer on AI as an Existential Risk by Oct 7, 2022, 7:23 PM; 13 points) (EA Forum;
- When is AI safety research harmful? by May 9, 2022, 10:36 AM; 13 points) (EA Forum;
- Why I think AI will go poorly for humanity by Mar 19, 2025, 3:52 PM; 13 points) (
- [AN #120]: Tracing the intellectual roots of AI and AI alignment by Oct 7, 2020, 5:10 PM; 13 points) (
- What Failure Looks Like is not an existential risk (and alignment is not the solution) by Feb 2, 2024, 6:59 PM; 13 points) (
- Feb 3, 2024, 2:16 PM; 12 points) 's comment on Why I no longer identify as transhumanist by (
- [AN #154]: What economic growth theory has to say about transformative AI by Jun 30, 2021, 5:20 PM; 12 points) (
- Lifeguards by Jun 15, 2022, 11:03 PM; 12 points) (
- Distilled—AGI Safety from First Principles by May 29, 2022, 12:57 AM; 11 points) (
- Jul 16, 2019, 5:09 AM; 11 points) 's comment on A Key Power of the President is to Coordinate the Execution of Existing Concrete Plans by (
- Pros and cons of working on near-term technical AI safety and assurance by Jun 17, 2021, 8:17 PM; 11 points) (
- Oct 8, 2019, 5:58 PM; 11 points) 's comment on AI Alignment Open Thread October 2019 by (
- May 8, 2020, 10:53 PM; 11 points) 's comment on AI Boxing for Hardware-bound agents (aka the China alignment problem) by (
- Apr 1, 2020, 9:30 AM; 11 points) 's comment on My current framework for thinking about AGI timelines by (
- Is AI Safety dropping the ball on privacy? by Sep 19, 2023, 8:17 AM; 10 points) (EA Forum;
- Large Language Models as Corporate Lobbyists, and Implications for Societal-AI Alignment by Jan 4, 2023, 10:22 PM; 10 points) (EA Forum;
- Measuring Coherence and Goal-Directedness in RL Policies by Apr 22, 2024, 6:26 PM; 10 points) (
- Introducing the Existential Risks Introductory Course (ERIC) by Aug 19, 2022, 3:54 PM; 10 points) (
- Responding to ‘Beyond Hyperanthropomorphism’ by Sep 14, 2022, 8:37 PM; 9 points) (
- Eli Lifland on Navigating the AI Alignment Landscape by Feb 1, 2023, 9:17 PM; 9 points) (
- Aug 20, 2020, 10:39 PM; 9 points) 's comment on Matt Botvinick on the spontaneous emergence of learning algorithms by (
- Dec 31, 2022, 1:25 AM; 9 points) 's comment on Zach Stein-Perlman’s Shortform by (
- Dec 21, 2024, 6:46 PM; 9 points) 's comment on quila’s Shortform by (
- Jul 7, 2019, 7:44 PM; 9 points) 's comment on A shift in arguments for AI risk by (
- Apr 9, 2023, 5:42 PM; 9 points) 's comment on All AGI Safety questions welcome (especially basic ones) [April 2023] by (
- More to explore on ‘Risks from Artificial Intelligence’ by Jul 15, 2022, 11:00 PM; 8 points) (EA Forum;
- Do EA folks want AGI at all? by Jul 16, 2022, 5:44 AM; 8 points) (EA Forum;
- Dec 16, 2021, 3:32 AM; 8 points) 's comment on My Overview of the AI Alignment Landscape: A Bird’s Eye View by (EA Forum;
- On Preference Manipulation in Reward Learning Processes by Aug 15, 2022, 7:32 PM; 8 points) (
- Oct 6, 2024, 11:26 AM; 8 points) 's comment on the case for CoT unfaithfulness is overstated by (
- Jan 6, 2020, 6:35 PM; 8 points) 's comment on [AN #80]: Why AI risk might be solved without additional intervention from longtermists by (
- Apr 5, 2022, 8:59 PM; 8 points) 's comment on What an actually pessimistic containment strategy looks like by (
- Jun 7, 2022, 10:55 AM; 8 points) 's comment on AGI Ruin: A List of Lethalities by (
- Sep 22, 2020, 3:49 PM; 7 points) 's comment on Forecasting Thread: Existential Risk by (
- May 27, 2023, 12:37 AM; 7 points) 's comment on Where do you lie on two axes of world manipulability? by (
- Jul 21, 2020, 8:18 AM; 6 points) 's comment on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher by (EA Forum;
- Dec 19, 2024, 8:53 AM; 6 points) 's comment on Blog update: Reflective altruism by (EA Forum;
- Thinking of tool AIs by Nov 20, 2019, 9:47 PM; 6 points) (
- Jun 28, 2024, 11:32 PM; 6 points) 's comment on Sycophancy to subterfuge: Investigating reward tampering in large language models by (
- Acknowledgements & References by Dec 14, 2019, 7:04 AM; 6 points) (
- Jul 30, 2019, 5:50 PM; 6 points) 's comment on Does it become easier, or harder, for the world to coordinate around not building AGI as time goes on? by (
- Sep 6, 2020, 1:08 AM; 6 points) 's comment on Tofly’s Shortform by (
- Apr 15, 2021, 12:14 PM; 6 points) 's comment on Homogeneity vs. heterogeneity in AI takeoff scenarios by (
- Oct 8, 2019, 5:59 PM; 6 points) 's comment on AI Alignment Open Thread October 2019 by (
- [Crosspost] A recent write-up of the case for AI (existential) risk by May 18, 2023, 1:13 PM; 6 points) (
- Jul 18, 2020, 11:34 PM; 5 points) 's comment on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher by (EA Forum;
- Jul 19, 2020, 1:55 PM; 5 points) 's comment on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher by (EA Forum;
- Apr 11, 2021, 6:12 PM; 5 points) 's comment on What does failure look like? by (EA Forum;
- Aug 8, 2022, 11:47 PM; 5 points) 's comment on Classifying sources of AI x-risk by (EA Forum;
- Jul 31, 2024, 12:39 AM; 5 points) 's comment on Twitter thread on politics of AI safety by (
- Evidence other than evolution for optimization daemons? by Apr 21, 2019, 8:50 PM; 5 points) (
- Jun 5, 2024, 9:47 AM; 4 points) 's comment on On “first critical tries” in AI alignment by (EA Forum;
- Jul 24, 2022, 9:00 AM; 4 points) 's comment on All AGI safety questions welcome (especially basic ones) [July 2022] by (
- May 12, 2024, 11:50 AM; 4 points) 's comment on We might be missing some key feature of AI takeoff; it’ll probably seem like “we could’ve seen this coming” by (
- Jan 16, 2020, 11:16 PM; 4 points) 's comment on Impact measurement and value-neutrality verification by (
- Feb 2, 2024, 8:28 PM; 4 points) 's comment on What Failure Looks Like is not an existential risk (and alignment is not the solution) by (
- Jul 10, 2024, 3:21 AM; 3 points) 's comment on Making AI Welfare an EA priority requires justifications that have not been given by (EA Forum;
- Dec 15, 2022, 8:13 PM; 3 points) 's comment on High-level hopes for AI alignment by (
- Jun 7, 2022, 5:45 PM; 3 points) 's comment on AGI Safety FAQ / all-dumb-questions-allowed thread by (
- Nov 1, 2022, 12:37 AM; 3 points) 's comment on Response to Katja Grace’s AI x-risk counterarguments by (
- Sep 23, 2020, 6:42 AM; 3 points) 's comment on AI Advantages [Gems from the Wiki] by (
- Apr 28, 2023, 7:05 PM; 3 points) 's comment on Realistic near-future scenarios of AI doom understandable for non-techy people? by (
- Dec 29, 2019, 12:13 AM; 3 points) 's comment on Free Speech and Triskaidekaphobic Calculators: A Reply to Hubinger on the Relevance of Public Online Discussion to Existential Risk by (
- 長期主義的AIガバナンスの展望:基礎的な概要 by Aug 17, 2023, 3:16 PM; 2 points) (EA Forum;
- Sep 8, 2019, 2:33 PM; 2 points) 's comment on Alien colonization of Earth’s impact the the relative importance of reducing different existential risks by (EA Forum;
- Mar 2, 2022, 10:18 PM; 2 points) 's comment on We’re Aligned AI, AMA by (EA Forum;
- Nov 12, 2022, 5:56 PM; 2 points) 's comment on The FTX Future Fund team has resigned by (EA Forum;
- Apr 4, 2023, 4:47 PM; 2 points) 's comment on Wizards and prophets of AI [draft for comment] by (Progress Forum;
- Jun 9, 2022, 4:15 AM; 2 points) 's comment on PASTA and Progress: The great irony by (Progress Forum;
- Apr 7, 2020, 7:24 PM; 2 points) 's comment on Core Tag Examples [temporary] by (
- Jan 12, 2021, 5:58 AM; 2 points) 's comment on What are the open problems in Human Rationality? by (
- Feb 21, 2021, 5:12 AM; 2 points) 's comment on 2019 Review: Voting Results! by (
- AI Safety (Week 3, AI Threat Modeling) - LW/ACX Meetup #198 (Wednesday, Aug 17th) by Aug 15, 2022, 4:17 AM; 2 points) (
- When is AI safety research harmful? by May 9, 2022, 6:19 PM; 2 points) (
- Jul 25, 2024, 1:58 PM; 2 points) 's comment on The case for stopping AI safety research by (
- Mar 18, 2020, 10:39 PM; 1 point) 's comment on My personal cruxes for working on AI safety by (EA Forum;
- Aug 14, 2024, 11:37 PM; 1 point) 's comment on Creating a “Conscience Calculator” to Guard-Rail an AGI by (EA Forum;
- [Opzionale] Approfondimenti sui rischi dell’IA (materiali in inglese) by Jan 18, 2023, 11:16 AM; 1 point) (EA Forum;
- Prevenire una catastrofe legata all’intelligenza artificiale by Jan 17, 2023, 11:07 AM; 1 point) (EA Forum;
- Nov 4, 2022, 8:20 AM; 1 point) 's comment on All AGI Safety questions welcome (especially basic ones) [~monthly thread] by (EA Forum;
- Nov 4, 2024, 9:27 PM; 1 point) 's comment on Cooperative AI: Three things that confused me as a beginner (and my current understanding) by (EA Forum;
- Jun 11, 2022, 11:27 PM; 1 point) 's comment on AGI Safety FAQ / all-dumb-questions-allowed thread by (
- Dec 3, 2019, 2:13 PM; 1 point) 's comment on Misconceptions about continuous takeoff by (
- The Market Singularity: A New Perspective by May 30, 2024, 7:05 AM; 1 point) (
- Jan 12, 2025, 5:46 PM; 1 point) 's comment on quila’s Shortform by (
- Oct 29, 2019, 10:17 PM; 1 point) 's comment on Impact measurement and value-neutrality verification by (
- Apr 8, 2023, 5:01 AM; -3 points) 's comment on Bringing Agency Into AGI Extinction Is Superfluous by (
- Nov 23, 2021, 9:45 PM; -4 points) 's comment on Yudkowsky and Christiano discuss “Takeoff Speeds” by (
- How AGI will actually end us: Some predictions on evolution by artificial selection by Apr 10, 2023, 1:52 PM; -11 points) (
As commenters have pointed out, the post is light on concrete details. Nonetheless, I found even the abstract stories much more compelling as descriptions-of-the-future (people usually focus on descriptions-of-the-world-if-we-bury-our-heads-in-the-sand). I think Part 2 in particular continues to be a good abstract description of the type of scenario that I personally am trying to avert.
Students of Yudkowsky have long contemplated hard-takeoff scenarios where a single AI bootstraps itself to superintelligence from a world much like our own. This post is valuable for explaining how the intrinsic risks might play out in a soft-takeoff scenario where AI has already changed Society.
Part I is a dark mirror of Christiano’s 2013 “Why Might the Future Be Good?”: the whole economy “takes off”, and the question is how humane-aligned does the system remain before it gets competent enough to lock in its values. (“Why might the future” says “Mostly”, “What Failure Looks Like” pt. I says “Not”.)
When I first read this post, I didn’t feel like I “got” Part II, but now I think I do. (It’s the classic “treacherous turn”, but piecemeal across Society in different systems, rather than in a single seed superintelligence.)