Getting into AI safety involves working with a mix of communities, subcultures, goals, and ideologies that you may not have encountered in the context of mainstream AI technical research. This document attempts to briefly map these out for newcomers.
This is inevitably going to be biased by what sides of these communities I (Sam) have encountered, and it will quickly become dated. I expect it will still be a useful resource for some people anyhow, at least in the short term.
AI Safety/AI Alignment/AGI Safety/AI Existential Safety/AI X-Risk
The research project of ensuring that future AI progress doesn’t yield civilization-endingly catastrophic results.
It’s plausible that future AI systems could be much faster or more effective than us at real-world reasoning and planning.
Probably not plain generative models, but possibly models derived from generative models in cheap ways
Once you have a system with superhuman reasoning and planning abilities, it’s easy to make it dangerous by accident.
Most simple objective functions or goals become dangerous in the limit, usually because of secondary or instrumental subgoals that emerge along the way.
Pursuing typical goals arbitrarily well requires a system to prevent itself from being turned off, by deception or force if needed.
Pursuing typical goals arbitrarily well requires acquiring any power or resources that could increase the chances of success, by deception or force if needed.
Toy example: Computing pi to an arbitrarily high precision eventually requires that you spend all the sun’s energy output on computing.
Knowledge and values are likely to be orthogonal: A model could know human values and norms well, but not have any reason to act on them. For agents built around generative models, this is the default outcome.
Sufficiently powerful AI systems could look benign in pre-deployment training/research environments, because they would be capable of understanding that they’re not yet in a position to accomplish their goals.
Simple attempts to work around this (like the more abstract goal ‘do what your operators want’) don’t tend to have straightforward robust implementations.
If such a system were single-mindedly pursuing a dangerous goal, we probably wouldn’t be able to stop it.
Superhuman reasoning and planning would give models with a sufficiently good understanding of the world many ways to effectively gain power with nothing more than an internet connection. (ex: Cyberattacks on banks.)
Consensus within the field is that these risks could become concrete within ~4–25 years, and have a >10% chance of being leading to a global catastrophe (i.e., extinction or something comparably bad). If true, it’s bad news.
Given the above, we either need to stop all development toward AGI worldwide (plausibly undesirable or impossible), or else do three possible-but-very-difficult things:
(i) build robust techniques to align AGI systems with the values and goals of their operators,
(ii) ensure that those techniques are understood and used by any group that could plausibly build AGI, and
(iii) ensure that we’re able to govern the operators of AGI systems in a way that makes their actions broadly positive for humanity as a whole.
Does this have anything to do with sentience or consciousness?
No.
Influential people and institutions:
Present core community as I see it: Paul Christiano, Jacob Steinhardt, Ajeya Cotra, Jared Kaplan, Jan Leike, Beth Barnes, Geoffrey Irving, Buck Shlegeris, David Krueger, Chris Olah, Evan Hubinger, Richard Ngo, Rohin Shah; ARC, Redwood Research, DeepMind, OpenAI, Anthropic, UC Berkeley’s CHAI, OpenPhil
nb: This list is especially likely to be subjective and contentious
Early/Pre Deep-Learning Revolution: Stuart Russell, Nick Bostrom, Eliezer Yudkowsky; MIRI, Future of Humanity Institute
Key fora:
Alignment Forum, arXiv, lots of private Slacks, informal events in Berkeley and Oxford
Terminological note:
AI Safety is the most common term for this project and this research community, but it’s also extremely vague, which is why you’ll often see the other terms thrown in as well. Lots of “AI Safety” projects/funders (example) have little to do with the issues described here.
Effective Altruism/EA
The research project and social movement of doing as much good as possible with limited resources.
What’s the connection to AI Safety?
EA researchers are unusually interested in potential global catastrophes, as they’re often quite neglected relative to how much of an impact we can have on them (many EAs are smug about the fact that EA was one of the biggest sources of funding/lobbying for pandemic preparedness before COVID).
This has meant that EA researchers and organizations have directed a lot of attention toward AI risks, and EA-sympathetic people with strong CS skills tend to gravitate toward AI risk research.
Influential people and institutions:
Will MacAskill, Habiba Islam, Rob Wiblin, Toby Ord, Holden Karnofsky, Sam Bankman-Fried, Peter Singer (historically); GiveWell, OpenPhil, Center for Effective Altruism, Giving What We Can, 80,000 Hours, FTX Future Fund
Misconception you’ll sometimes see: Elon Musk and Peter Thiel were both briefly adjacent to EA a decade ago, but were never especially active and haven’t funded or participated in any significant projects since then. Views on them within EA are mixed.
Key fora:
EA Forum, EA Global (EAG) conferences, small invite-only research workshops
Longtermism
The ethical principle that the consequences of our actions on other people matter equally wherever and whenever those consequences are felt. Because there are a potentially huge number of future people we could influence by our choices, this says that considering our influence on the longer-term future should be a central part of ethical decision-making.
It’s not related in any deep way: Under typical assumptions about AI risk, it’d impact a huge number of people who are living now, so it’s a huge deal under almost any reasonable ethical framework.
…but it’s a disproportionately common view among EAs and AI safety people, and there was a bunch of press about it so you’ll hear about it.
It also tends to have some of the weirdest optics of all of these communities, since much of the funding for explicitly longtermist projects comes from EA crypto billionaire Sam Bankman-Fried.
Influential people and institutions:
Will MacAskill, Toby Ord, Nick Beckstead, OpenPhil, 80,000 Hours, FTX Future Fund
The Rationalist Subculture/The LessWrong Crowd/Berkeley-Style Rationalism/The Rats
A distinctive social group focused on using reason and science as thoroughly and deeply as possible in everyday life and important life decisions.
Early AI safety writer Eliezer Yudkowsky was also one of the originators of the rationalist subculture, and the early rationalist organization Center for Applied Rationality was known for running skill-building workshops that pushed participants hard to work on AI risks. So, a decent number of long-time AI safety researchers got involved by way of rationalism.
EA was influenced by the rationalists early on, so there’s an indirect connection there.
Signature features
Emphasis on blunt communication and asking for things rather than hinting at them
Emphasis on using explicit probabilities and bets in everyday situations
Interest in experimental lifestyle choices, like polyamory, Soylent/Huel-style meal replacements, experimental sleep schedules, nootropics
Influential people and institutions:
Eliezer Yudkowsky, The Astral Codex Ten (formerly Slate Star Codex/SSC) blog
The view that building (aligned) AGI will lead to a post-scarcity, galaxy-spanning, pluralist utopia and would be humanity’s greatest achievement.
What’s the connection to AI safety?
This view is somewhat common within AI safety. It’s not incompatible with being concerned about misaligned AGI, and many prominent AGI optimists are concerned about alignment, but there is an obvious tension between the two ideas.
This view was influential in the creation of OpenAI and DeepMind, and is responsible for a good deal of tension within the leadership of both organizations.
The research and political project of minimizing the harms of current and near-future AI/ML technology and of ensuring that any benefits from such technology are shared broadly.
What’s the connection to AI Safety?
A small but non-trivial minority of the questions overlap, and a lot of technical work in this community deals with alignment issues.
Culturally, there’s some tension between this community and the AI safety/EA communities, though: The FAccT community tends to emphasize the importance of political work and related norm-setting, and tends to be overtly part of the political left and adversarial toward big tech. AI safety (and the related AI governance community; see below) tends to try to stay away from potentially-contentious political topics and emphasizes coalition-building.
This difference in strategy isn’t an accident: When the concerns involve global catastrophic risks, making political progress in only the US or EU doesn’t solve the problem, as long as there are labs anywhere else in the world doing dangerous work, and risks differentially harming the most careful labs. Whereas for most present-day concerns, US/EU tech regulations really can make a big difference.
(Long-Term) AI Governance
The project of developing institutions and policies within present-day governments to help increase the chances that AI progress goes well.
What’s the connection to AI safety?
This is arguably a subset of AI safety work, but involves very different skill sets. It’s also much less culturally weird, since it involves close collaboration with government officials.
Compared to short-term AI policy work, a popular view here is that we don’t understand the problem well enough for sweeping regulation to be productive.
Emphasis is on building awareness and expertise within governments, preventing panicky arms-race dynamics from emerging out of military AI efforts, and making smaller policy changes to encourage safety research.
Key people:
Jade Leung, Helen Toner, Allan Dafoe, Jack Clark, Center for the Study of Existential Risk, GovAI
Acknowledgments
Thanks to Alex Tamkin, Jared Kaplan, Neel Nanda, Leo Gao, Fazl Barez, Owain Evans, Beth Barnes, and Rohin Shah for comments on a previous version of this.
AI Safety and Neighboring Communities: A Quick-Start Guide, as of Summer 2022
Getting into AI safety involves working with a mix of communities, subcultures, goals, and ideologies that you may not have encountered in the context of mainstream AI technical research. This document attempts to briefly map these out for newcomers.
This is inevitably going to be biased by what sides of these communities I (Sam) have encountered, and it will quickly become dated. I expect it will still be a useful resource for some people anyhow, at least in the short term.
AI Safety/AI Alignment/AGI Safety/AI Existential Safety/AI X-Risk
The research project of ensuring that future AI progress doesn’t yield civilization-endingly catastrophic results.
Good intros:
Carlsmith Report
What misalignment looks like as capabilities scale
Vox piece
Why are people concerned about this?
My rough summary:
It’s plausible that future AI systems could be much faster or more effective than us at real-world reasoning and planning.
Probably not plain generative models, but possibly models derived from generative models in cheap ways
Once you have a system with superhuman reasoning and planning abilities, it’s easy to make it dangerous by accident.
Most simple objective functions or goals become dangerous in the limit, usually because of secondary or instrumental subgoals that emerge along the way.
Pursuing typical goals arbitrarily well requires a system to prevent itself from being turned off, by deception or force if needed.
Pursuing typical goals arbitrarily well requires acquiring any power or resources that could increase the chances of success, by deception or force if needed.
Toy example: Computing pi to an arbitrarily high precision eventually requires that you spend all the sun’s energy output on computing.
Knowledge and values are likely to be orthogonal: A model could know human values and norms well, but not have any reason to act on them. For agents built around generative models, this is the default outcome.
Sufficiently powerful AI systems could look benign in pre-deployment training/research environments, because they would be capable of understanding that they’re not yet in a position to accomplish their goals.
Simple attempts to work around this (like the more abstract goal ‘do what your operators want’) don’t tend to have straightforward robust implementations.
If such a system were single-mindedly pursuing a dangerous goal, we probably wouldn’t be able to stop it.
Superhuman reasoning and planning would give models with a sufficiently good understanding of the world many ways to effectively gain power with nothing more than an internet connection. (ex: Cyberattacks on banks.)
Consensus within the field is that these risks could become concrete within ~4–25 years, and have a >10% chance of being leading to a global catastrophe (i.e., extinction or something comparably bad). If true, it’s bad news.
Given the above, we either need to stop all development toward AGI worldwide (plausibly undesirable or impossible), or else do three possible-but-very-difficult things:
(i) build robust techniques to align AGI systems with the values and goals of their operators,
(ii) ensure that those techniques are understood and used by any group that could plausibly build AGI, and
(iii) ensure that we’re able to govern the operators of AGI systems in a way that makes their actions broadly positive for humanity as a whole.
Does this have anything to do with sentience or consciousness?
No.
Influential people and institutions:
Present core community as I see it: Paul Christiano, Jacob Steinhardt, Ajeya Cotra, Jared Kaplan, Jan Leike, Beth Barnes, Geoffrey Irving, Buck Shlegeris, David Krueger, Chris Olah, Evan Hubinger, Richard Ngo, Rohin Shah; ARC, Redwood Research, DeepMind, OpenAI, Anthropic, UC Berkeley’s CHAI, OpenPhil
nb: This list is especially likely to be subjective and contentious
Early/Pre Deep-Learning Revolution: Stuart Russell, Nick Bostrom, Eliezer Yudkowsky; MIRI, Future of Humanity Institute
Key fora:
Alignment Forum, arXiv, lots of private Slacks, informal events in Berkeley and Oxford
Terminological note:
AI Safety is the most common term for this project and this research community, but it’s also extremely vague, which is why you’ll often see the other terms thrown in as well. Lots of “AI Safety” projects/funders (example) have little to do with the issues described here.
Effective Altruism/EA
The research project and social movement of doing as much good as possible with limited resources.
What’s the connection to AI Safety?
EA researchers are unusually interested in potential global catastrophes, as they’re often quite neglected relative to how much of an impact we can have on them (many EAs are smug about the fact that EA was one of the biggest sources of funding/lobbying for pandemic preparedness before COVID).
This has meant that EA researchers and organizations have directed a lot of attention toward AI risks, and EA-sympathetic people with strong CS skills tend to gravitate toward AI risk research.
Influential people and institutions:
Will MacAskill, Habiba Islam, Rob Wiblin, Toby Ord, Holden Karnofsky, Sam Bankman-Fried, Peter Singer (historically); GiveWell, OpenPhil, Center for Effective Altruism, Giving What We Can, 80,000 Hours, FTX Future Fund
Misconception you’ll sometimes see: Elon Musk and Peter Thiel were both briefly adjacent to EA a decade ago, but were never especially active and haven’t funded or participated in any significant projects since then. Views on them within EA are mixed.
Key fora:
EA Forum, EA Global (EAG) conferences, small invite-only research workshops
Longtermism
The ethical principle that the consequences of our actions on other people matter equally wherever and whenever those consequences are felt. Because there are a potentially huge number of future people we could influence by our choices, this says that considering our influence on the longer-term future should be a central part of ethical decision-making.
Good intro:
NYTimes Op-Ed, Time Magazine feature
What’s the connection to AI Safety?
It’s not related in any deep way: Under typical assumptions about AI risk, it’d impact a huge number of people who are living now, so it’s a huge deal under almost any reasonable ethical framework.
…but it’s a disproportionately common view among EAs and AI safety people, and there was a bunch of press about it so you’ll hear about it.
It also tends to have some of the weirdest optics of all of these communities, since much of the funding for explicitly longtermist projects comes from EA crypto billionaire Sam Bankman-Fried.
Influential people and institutions:
Will MacAskill, Toby Ord, Nick Beckstead, OpenPhil, 80,000 Hours, FTX Future Fund
The Rationalist Subculture/The LessWrong Crowd/Berkeley-Style Rationalism/The Rats
A distinctive social group focused on using reason and science as thoroughly and deeply as possible in everyday life and important life decisions.
Good intro:
The Sequences (featured early posts on LessWrong)
What’s the connection to AI Safety?
Early AI safety writer Eliezer Yudkowsky was also one of the originators of the rationalist subculture, and the early rationalist organization Center for Applied Rationality was known for running skill-building workshops that pushed participants hard to work on AI risks. So, a decent number of long-time AI safety researchers got involved by way of rationalism.
EA was influenced by the rationalists early on, so there’s an indirect connection there.
Signature features
Emphasis on blunt communication and asking for things rather than hinting at them
Emphasis on using explicit probabilities and bets in everyday situations
Interest in experimental lifestyle choices, like polyamory, Soylent/Huel-style meal replacements, experimental sleep schedules, nootropics
Influential people and institutions:
Eliezer Yudkowsky, The Astral Codex Ten (formerly Slate Star Codex/SSC) blog
Key fora:
LessWrong
Related keywords/communities:
Skepticism, humanism, new atheism
AGI Optimism
The view that building (aligned) AGI will lead to a post-scarcity, galaxy-spanning, pluralist utopia and would be humanity’s greatest achievement.
What’s the connection to AI safety?
This view is somewhat common within AI safety. It’s not incompatible with being concerned about misaligned AGI, and many prominent AGI optimists are concerned about alignment, but there is an obvious tension between the two ideas.
This view was influential in the creation of OpenAI and DeepMind, and is responsible for a good deal of tension within the leadership of both organizations.
Side note:
This has some overlap with transhumanism.
Key people:
Sam Altman, Demis Hassabis, Elon Musk
AI Ethics/Responsible AI/The FAccT Community
The research and political project of minimizing the harms of current and near-future AI/ML technology and of ensuring that any benefits from such technology are shared broadly.
What’s the connection to AI Safety?
A small but non-trivial minority of the questions overlap, and a lot of technical work in this community deals with alignment issues.
Culturally, there’s some tension between this community and the AI safety/EA communities, though: The FAccT community tends to emphasize the importance of political work and related norm-setting, and tends to be overtly part of the political left and adversarial toward big tech. AI safety (and the related AI governance community; see below) tends to try to stay away from potentially-contentious political topics and emphasizes coalition-building.
This difference in strategy isn’t an accident: When the concerns involve global catastrophic risks, making political progress in only the US or EU doesn’t solve the problem, as long as there are labs anywhere else in the world doing dangerous work, and risks differentially harming the most careful labs. Whereas for most present-day concerns, US/EU tech regulations really can make a big difference.
(Long-Term) AI Governance
The project of developing institutions and policies within present-day governments to help increase the chances that AI progress goes well.
What’s the connection to AI safety?
This is arguably a subset of AI safety work, but involves very different skill sets. It’s also much less culturally weird, since it involves close collaboration with government officials.
Compared to short-term AI policy work, a popular view here is that we don’t understand the problem well enough for sweeping regulation to be productive.
Emphasis is on building awareness and expertise within governments, preventing panicky arms-race dynamics from emerging out of military AI efforts, and making smaller policy changes to encourage safety research.
Key people:
Jade Leung, Helen Toner, Allan Dafoe, Jack Clark, Center for the Study of Existential Risk, GovAI
Acknowledgments
Thanks to Alex Tamkin, Jared Kaplan, Neel Nanda, Leo Gao, Fazl Barez, Owain Evans, Beth Barnes, and Rohin Shah for comments on a previous version of this.