Resources for AI Alignment Cartography
I want to make an actionable map of AI alignment.
After years of reading papers, blog posts, online exchanges, books, and occasionally hidden documents about AI alignment and AI risk, and having extremely interesting conversations about it, most arguments I encounter now feel familiar at best, rehashed at worst. This should mean I have a good map of the field being discussed.
I have been, however, frustrated by how little actual advice I could derive from this map. The message I understood from most agendas was “here are the tractable sub-problems we will work on and why they should be solved”. I didn’t find much justification for why they are critically important, or why one particular set of problems should be explored instead of the next research group’s set.
So I looked harder. I found useful mapping work, yet nothing quite exactly like what I was looking for. I also found related concerns in this post and this comment thread.
You’ll find, in the following sections, my (current) selection of:
“cartography work”, to draw a map of relevant arguments and concepts;
research agendas, from research groups or individuals;
points of entry for newcomers.
Here are the caveats. The list is not exhaustive. I did try to cover as many visible ideas as possible, and there will be significant overlap and cross-references between the items listed here. Some references I consider useful (e.g. this) have not made the cut. I attempted to categorize the resources by focus, but a handful could have ended up in a different category. Please don’t rely on it too much.
My comments aren’t summaries, rather justifications for why I included the reference. They also reuse liberally the original formulations. Please tell me if I left strong misrepresentations of the ideas in there.
All these references, and countless comments scattered all across LessWrong, the Alignment forum, and the Effective Altruism forum, will hopefully help me build something actionable, something that would let newcomers and experts explore the field with more clarity and make better decisions.
My short-term plan is to create minimal interactive explanations for the relevance of various propositions in AI alignment, with the option to question and expand their premises. I want to do this for a first few high-level ideas, and if it goes well, expand to a first full scenario.
The long-term plan is to map as many propositions and available scenarios as possible, to have a common framework in which to compare research directions. My intuition (to be challenged) is that there’s broad agreement in the field on most premises I could describe, and that we would benefit a lot from locating cruxes (e.g. here). My overarching motivation is to reduce research debt.
The references here will be my first source of information. The second one would be discussions. If you are the author of one of the resources below and/or if you had more conversations about alignment-related arguments than you can remember, and want to share your insights, please reach out to me. I will do my best to answer in a timely manner.
Thanks to Adam Shimi , Alexis Carlier and Maxime Riché for reviewing drafts of this post and suggesting resources!
Argument mapping & reviews
Disentangling arguments for the importance of AI safety
Richard Ngo—January 2019
Splits the core motivating arguments for AI safety into six rough categories: maximizers being dangerous, target loading, prosaic alignment, human safety, misuses/vulnerabilities, and large impact.
Makes the case for more clarity around the fundamental ideas, analysis of the arguments, description of deployment scenarios, as well as making more explicit the assumptions behind research agendas.
Clarifying some key hypotheses in AI alignment
Ben Cottier, Rohin Shah—August 2019
Creates a diagram linking hypotheses, scenarios, agendas, and catastrophic problems. Selects for debated and important arguments, does not claim to be comprehensive, links ideas through diverse relationships (support, conditional support, entailment, etc.)
The post itself goes into more details on the hypotheses, with resources listed for each one.
My personal cruxes for working on AI safety
Buck Shlegeris—January 2020
The first section of the talk highlights the limits of heuristic arguments, the usefulness of spelling out premises and making a deliberate effort to build compelling arguments for your personal stance.
The talk then proceeds to detail the speaker’s own argument for AI alignment work. Many commenters express their gratitude for all this exposition.
How sure are we about this AI stuff?
Ben Garfinkel—February 2019
Runs through the intuitive arguments being AI risk prioritization: “AI as a big deal”, instability, lock-in, and accidents. Expands why each of them aren’t forceful, or with missing pieces/details.
Calls for the arguments being fleshed out further as a neglected issue, with potential high value.
A shift in arguments for AI risk
Tom Adamczewski—February 2019
Describes the evolution of AI risk arguments, from early descriptions of the alignment problem, to discontinuities as a premise for Bostrom’s Superintelligence, to alignment issues without discontinuity. Also describes non-alignment catastrophes, such as misuse risks.
Calls for clarification of arguments related to AI risk, especially on the subject of discontinuities, for better prioritization, and reduction of costly misunderstandings.
Scenarios, forecasting & strategy
AI Impacts (selected references)
AI Impacts contributors—Since 2014
The website in general is dedicated to building AI forecasting resources, to inform arguments and decisions. Some of their content most closely related to AI risk arguments:
Takeaways from safety by default interviews—April 2020
Evidence against current methods leading to human level artificial intelligence—August 2019
Likelihood of discontinuous progress around the development of AGI—February 2018
Paul Christiano—March 2019
Describes two scenarios for AI catastrophe which don’t depend on a fast surprise takeover by a powerful AI system. Also notable for the level of engagement in the comments.
Disjunctive Scenarios of Catastrophic AI Risk
Kaj Sotala—February 2018
Breaks down a wide range of scenarios leading to (at least) catastrophic risk, by decomposing them into a variety of factors: strategic advantage, takeoff speed, autonomy acquisition, plurality of agents, etc.
Explores the idea of there being multiple combinations of factors which may be realized, each of them leading to a catastrophe (as opposed to a specific privileged scenario, which may receive too much focus).
Wei Dai, Daniel Kokotajlo—March 2019 (last updated March 2020)
Thirty-two (and counting) high-level scenarios for AI catastrophe. Wei Dai emphasizes that they aren’t disjunctive, as some scenarios may subsume or cause others. Daniel Kokotajlo (who maintains and updates the list) suggests it could be refined, expanded and reorganized.
Chris Olah’s views on AGI safety
Evan Hubinger—November 2019
Reports arguments on the importance of transparency and interpretability, and about how to improve the field of machine learning to make progress on these issues.
Classification of global catastrophic risks connected with artificial intelligence
Alexey Turchin, David Denkenberger—January 2018
Lists and categorizes a wide range of catastrophic scenarios, from narrow or general AI, near-term or long-term, misuse or accidents, and many other factors, with references.
Agendas & reports focused on problem framing
Scott Garrabrant, Abram Demski—November 2018
Clarifies and motivates technical research stemming from the idea of embedded agents, where AI systems are no longer logically separated from their environment, implying modeling and self-modification issues, among others.
Describes the subproblems associated with that hypothesis: decision theory, embedded world-models, robust delegation, and subsystem alignment.
AI Governance: A Research Agenda
Allan Dafoe—August 2018
From the Center for the Governance of AI, Future of Humanity Institute. The agenda aims for superficial comprehensiveness, gathering as many questions relevant to AI Governance as possible in 53 pages, and providing extensive references for further details. It doesn’t focus on prioritization, nor tractability/impact estimates.
The questions are divided in three clusters: technical landscape (modeling and forecasting AI progress, mapping AI capabilities, and technical AI safety), AI politics (transformation of government, of the job market, and regulatory concerns), and ideal AI governance (desirable values, institutions and scenarios).
Building safe artificial intelligence: specification, robustness, and assurance
Pedro A. Ortega, Vishal Maini, DeepMind—September 2018
Motivates DeepMind’s technical AI safety research, dividing it in three areas: specification (how to define the purpose of a system, whether explicitly designed or emergent), robustness (how to prevent, anticipate, defend against, and recover from perturbations), and assurance (understand, evaluate and actively control the behavior of a system).
The post defines a broad array of technical terms. The challenges are grounded in problems already present in current AI systems, and in simple environments (gridworlds).
Concrete Problems in AI Safety
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané - June 2016
Describes and motivates five technical safety research problems in machine learning-based systems, tractable through direct experimentation, in toy environments and/or small-scale models. All problems, sub-problems, and proposed abstract solutions are grounded in the existing machine learning literature.
The authors also argue for the increasing relevance of these problems as AI capabilities progress.
Reframing Superintelligence: Comprehensive AI Services as General Intelligence
K. Eric Drexler—January 2019
Expands on Bostrom’s Superintelligence, through a mesh of forty high-level statements addressing the possibility of an intelligence explosion, the nature of advanced machine intelligence, the relationship between goals and intelligence, the use and control of advanced AI, and near/long-term considerations in AI safety & strategy.
The common underlying framing is a service-centered model of general intelligence, suggesting the integration of a diversity of task-oriented systems built incrementally, rather than mostly independent, self-improving superintelligent agents.
AI alignment reviews
AI Alignment Research Overview
Jacob Steinhardt—October 2019
Outlines four broad categories of technical work: technical alignment (how to create aligned AI), detecting failures (how to proactively check for alignment), methodological understanding (best practices), and system-building (how to do the previous three for large systems).
All problems (or sub-problems, for the first category) are explored through a high-level definition, motivation, solution desiderata, possible research avenues, personal takes, and references.
The Landscape of AI Safety and Beneficence Research
Richard Mallah - January 2017
Maps a large set of concepts and techniques in AI safety. The core content can be explored in this interactive visualization. The concepts are primarily organized through a hierarchical map, with secondary links for related ideas. All concepts are given high-level descriptions with references.
The stated purpose of the work is to provide a comprehensive map and a reference set of concepts for the field, to be extended through further research.
Rohin Shah—January 2020
The first section of the post is dedicated to recent work in basic AI risk analysis: new explorations of goal-directedness and comprehensive AI services, as well as new write-ups for, or against AI risk (many of which are listed in this very document).
The rest of the post details recent work in the many sub-problems of AI alignment, noting that the over 300 references have been selected from a larger set of around 500 articles, clustered for readability (the reader shouldn’t take the chosen categorization as authoritative).
2019 AI Alignment Literature Review and Charity Comparison
Larks—December 2019
Sorts AI alignment work by origin, and not by topic. It highlights more specifically the agendas of the various research teams, and lists the collaborations between them. It also references a wide range of independent research.
In addition, the post details the funding of the various organizations involved in the field, as well as methodological comments on prioritization, funding, and research avenues.
Tom Everitt, Gary Lea, Marcus Hutter—May 2018
Focuses specifically on powerful AI systems: plausible conceptual models; forecasting of capability increase and risks; technical safety problems; design ideas and concepts; and public policy.
The paper explores safety problems shared by multiple research agendas, and summarizes a wide range of publications in the domain.
Introductory material
Benefits & Risks of Artificial Intelligence
Future of Life Institute—November 2015 (first version)
Summarizes in an accessible way the very high-level case for AI alignment research, the most common naive objections and misconceptions, with further reading references.
Superintelligence: Paths, Dangers, Strategies
Nick Bostrom—July 2014
Makes the case for the risk from superintelligent entities (not necessarily AI systems, though it is presented as its most probable origin). The book represents an early edited, long-form, philosophical introduction to numerous concepts such as the control problem, takeoff speeds, treacherous turn, instrumental convergence, decisive strategic advantage, value loading, and many more.
Human Compatible: Artificial Intelligence and the Problem of Control
Stuart Russell—October 2019
Makes the case for the risk from advanced AI systems through failure of alignment. The book describes the continued progress in AI capabilities, reviews critically the major arguments around AI risk and forecasting, and argues for early safety research, showcasing significant hurdles to solve, and possible research avenues.
Potential Risks from Advanced Artificial Intelligence: The Philanthropic Opportunity
Holden Karnofsky—May 2016
Makes the philanthropic case for AI risk research, describing three classes of risk: misuse risk (malevolent, or value-locking use of powerful technology), accident risk (stemming typically from alignment failure) and other risks (such as structural effects due to automation, or dissemination of increasingly capable tools). Also explains several principles for prioritization work.
Paul Christiano—June 2019
Decomposes the then-current main approaches in AI alignment research by building a tree diagram and giving friendly high-level explanations of the ideas. The exploration is itself biased towards iterated amplification, which is put in its broader context.
Many authors—From 2014 to 2018
Provides detailed explanations for many concepts in AI Alignment, in an explorable way. Now in an archived state.
Robert Miles’s YouTube channel
Robert Miles—Since 2014
Clear and friendly explanations of many concepts in AI alignment. For introductory material, it is best to start with his Computerphile videos, produced before the channel’s creation.
Technical agendas focused on possible solutions
Paul Christiano—October 2018
Describes iterated amplification, an alignment technique for powerful ML-based systems. Spells out the core hypotheses behind the validity of the techniques. In the fourth section, details the associated research directions, and desiderata for AI alignment research.
Rohin Shah, Paul Christiano, Stuart Armstrong, Jacob Steinhardt, Owain Evans—October 2018
Investigates and motivates value learning, discussing the arguments stemming from the idea of a powerful AI system pursuing a particular utility function, using human behavior as a data source. Clearly restates the core arguments in the conclusion post.
Alex Turner—July 2019
Explores and motivates new ways to work with impact measures, a common component of various approaches in AI safety research, and how to think about scenarios where a powerful AI system makes wide-ranging decisions and actions.
Research Agenda v0.9: Synthesising a human’s preferences into a utility function
Stuart Armstrong—June 2019
Clarifies and motivates a technical agenda for building specific assumptions into AI systems that would let them infer human preferences, as an instrumental goal for aligning onto them.
Deconfusing Human Values Research Agenda v1
G Gordon Worley III—March 2020
Defines a technical agenda for building a formal expression of the structure of human values, modeling them as the input of their decision process.
The Learning-Theoretic AI Alignment Research Agenda
Vanessa Kosoy—July 2018
Details and motivates philosophically a technical agenda to ground AI alignment in statistical and computational learning theory, as well as algorithmic information theory.
Scalable agent alignment via reward modeling: a research direction
Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg—November 2018
DeepMind paper, which defines a more specific agenda around the learning of a reward function through reinforcement learning, from interaction with a human user, in a way that scales to complex and general domains.
AI Safety Needs Social Scientists
Geoffrey Irving, Amanda Askell—February 2019
Explores and motivates the debate approach to alignment, learning human values through experiments, asking questions and arbitrating between arguments. Tied to the AI safety via debate OpenAI paper.
Special mentions
Technical AGI safety research outside AI
Richard Ngo—October 2019
The first section Studying and understanding safety problems motivates this very project. The entire post is full of interesting problems to solve.
Victoria Krakovna—Regularly updated since August 2017
Provides a wealth of useful references, which significantly helped expand this list. Still receiving updates!
To reiterate, just above the comment box : I’m looking for insights. If your favorite reference is missing; if you spot a glaring error; if you have a strong opinion on research directions; if you share my frustrations, or disagree: do share! (Yes, the post is long, please don’t let that stop you from engaging).
- Shallow review of live agendas in alignment & safety by 27 Nov 2023 11:10 UTC; 332 points) (
- AI Safety Papers: An App for the TAI Safety Database by 21 Aug 2021 2:02 UTC; 81 points) (
- Appendices to the live agendas by 27 Nov 2023 11:10 UTC; 16 points) (
- 5 Aug 2020 10:53 UTC; 8 points) 's comment on Solving Key Alignment Problems Group by (
- 7 Sep 2021 20:30 UTC; 7 points) 's comment on List of AI safety courses and resources by (EA Forum;
- 24 Jan 2021 15:01 UTC; 2 points) 's comment on Clarifying some key hypotheses in AI alignment by (
- 24 Dec 2022 1:49 UTC; 1 point) 's comment on Why is “Argument Mapping” Not More Common in EA/Rationality (And What Objections Should I Address in a Post on the Topic?) by (EA Forum;
- 8 Aug 2020 17:13 UTC; 1 point) 's comment on Solving Key Alignment Problems Group by (
I know you link/mention Rohin’s map. I think Paul or Chris Olah had put together another map at one time. How do you see your work differing from or building on what they’ve done?
Is Paul’s map the one in Current Work in AI Alignment? I think Rohin also used it in his online-EAG 2020 presentation. For Rohin’s map, are you referring to Ben Cottier’s Clarifying some key hypotheses in AI alignment, to which Rohin made major contributions? I’ll be referring to those two in the rest of my answer.
I want to make more explicit the relationships between the premises and outcomes included in the diagrams. The goal of my work is to make those kinds of questions easier to answer:
Are scenarios X and Y mutually exclusive? If they are, is the split sharp (is there a premise P which prevents X if true, and prevents Y if false)?
What are the premises behind the work on a specific problem? Which events or results would make this work irrelevant?
Does it make sense to “partially solve” problem P? Are there efforts which won’t make any difference until something specific happens?
I find it hard to answer those questions with the diagrams, since (from my understanding) they have other goals entirely. Paul’s map shows how current research questions relate to each other, with closer elements in the tree sharing more concepts and techniques. Ben & Rohin’s map show which questions are controversial and which debates feed into others, and which very broad scenarios/agendas are relevant to them.
You can answer the questions listed above by integrating the diagram with the post details, and following references… but it isn’t convenient. I want to make it easier to discover and engage with that knowledge.
The main difference between my (future) work and the diagrams would be to enable the user to explore one specific scenario/research question at a time. For example, in Paul’s talk, that would mean starting from « iterated amplification » and repeatedly asking « why ? » as you go up the tree. I want the user to find out what happens if one of the premises doesn’t hold: is the work still useful? If we want to maintain the premise, what are the load-bearing sub-premises?
I expect a lot of the structure in the diagrams will be mirrored in the end result anyway, as it should, since it’s the same knowledge. I hope to distill it in a different way.
Thanks, that really helpful to understand your work better!
I am running a large-scale version of this, with contributors from multiple organizations. We should definitely discuss. Can you message me or email me? Aryeh.Englander at jhuapl.edu. Thanks!
We should indeed! I just sent you an email.
I don’t know whether this is on purpose, but I’d think that AI Safety Via Debate (original paper: https://arxiv.org/abs/1805.00899; recent progress report: https://www.lesswrong.com/posts/Br4xDbYu4Frwrb64a/writeup-progress-on-ai-safety-via-debate-1) should get a mention, probably in the Technical agendas focused on possible solutions section? I’d argue it’s different enough from IDA to have it’s own subititle.
It was in the references that initially didn’t make the cut. After further thought, it’s indeed worth adding. I referenced the Distill article AI Safety Needs Social Scientists, which spends more time on the motivating arguments, and linked to the paper in the note.
Thanks for your feedback!
No worries. As much as I think less has been written on debate than amplification (Paul has a lot of blog posts on IDA), it seems to me like most of the work Paul’s team at OpenAI is doing is working on debates rather than IDA.