AI Safety “Success Stories”
AI safety researchers often describe their long term goals as building “safe and efficient AIs”, but don’t always mean the same thing by this or other seemingly similar phrases. Asking about their “success stories” (i.e., scenarios in which their line of research helps contribute to a positive outcome) can help make clear what their actual research aims are. Knowing such scenarios also makes it easier to compare the ambition, difficulty, and other attributes of different lines of AI safety research. I hope this contributes to improved communication and coordination between different groups of people working on AI risk.
In the rest of the post, I describe some common AI safety success stories that I’ve heard over the years and then compare them along a number of dimensions. They are listed in roughly the order in which they first came to my attention. (Suggestions welcome for better names for any of these scenarios, as well as additional success stories and additional dimensions along which they can be compared.)
The Success Stories
Sovereign Singleton
AKA Friendly AI, an autonomous, superhumanly intelligent AGI that takes over the world and optimizes it according to some (perhaps indirect) specification of human values.
Pivotal Tool
An oracle or task AGI, which can be used to perform a pivotal but limited act, and then stops to wait for further instructions.
Corrigible Contender
A semi-autonomous AGI that does not have long-term preferences of its own but acts according to (its understanding of) the short-term preferences of some human or group of humans, it competes effectively with comparable AGIs corrigible to other users as well as unaligned AGIs (if any exist), for resources and ultimately for influence on the future of the universe.
Interim Quality-of-Life Improver
AI risk can be minimized if world powers coordinate to limit AI capabilities development or deployment, in order to give AI safety researchers more time to figure out how to build a very safe and highly capable AGI. While that is proceeding, it may be a good idea (e.g., politically advisable and/or morally correct) to deploy relatively safe, limited AIs that can improve people’s quality of life but are not necessarily state of the art in terms of capability or efficiency. Such improvements can for example include curing diseases and solving pressing scientific and technological problems.
(I want to credit Rohin Shah as the person that I got this success story from, but can’t find the post or comment where he talked about it. Was it someone else?)
Research Assistant
If an AGI project gains a lead over its competitors, it may be able to grow that into a larger lead by building AIs to help with (either safety or capability) research. This can be in the form of an oracle, or human imitation, or even narrow AIs useful for making money (which can be used to buy more compute, hire more human researchers, etc). Such Research Assistant AIs can help pave the way to one of the other, more definitive success stories. Examples: 1, 2.
Comparison Table
Sovereign Singleton | Pivotal Tool | Corrigible Contender | Interim Quality-of-Life Improver | Research Assistant | |
---|---|---|---|---|---|
Autonomy | High | Low | Medium | Low | Low |
AI safety ambition / difficulty | Very High | Medium | High | Low | Low |
Reliance on human safety | Low | High | High | Medium | Medium |
Required capability advantage over competing agents | High | High | None | None | Low |
Tolerates capability trade-off due to safety measures | Yes | Yes | No | Yes | Some |
Assumes strong global coordination | No | No | No | Yes | No |
Controlled access | Yes | Yes | No | Yes | Yes |
(Note that due to limited space, I’ve left out a couple of scenarios which are straightforward recombinations of the above success stories, namely Sovereign Contender and Corrigible Singleton. I also left out CAIS because I find it hard to visualize it clearly enough as a success story to fill out its entries in the above table, plus I’m not sure if any safety researchers are currently aiming for it as a success story.)
The color coding in the table indicates how hard it would be to achieve the required condition for a success story to come to pass, with green meaning relatively easy, and yellow/pink/violet indicating increasing difficulty. Below is an explanation of what each row heading means, in case it’s not immediately clear.
Autonomy
The opposite of human-in-the-loop.
AI safety ambition/difficulty
Achieving each success story requires solving a different set of AI safety problems. This is my subjective estimate of how ambitious/difficult the corresponding set of AI safety problems is. (Please feel free to disagree in the comments!)
Reliance on human safety
How much does achieving this success story depend on humans being safe, or on solving human safety problems? This is also a subjective judgement because different success stories rely on different aspects of human safety.
Required capability advantage over competing agents
Does achieving this success story require that the safe/aligned AI have a capability advantage over other agents in the world?
Tolerates capability trade-off due to safety measures
Many ways of achieving AI safety have a cost in terms of lowering the capability of an AI relative to an unaligned AI built using comparable resources and technology. In some scenarios this is not as consequential (e.g., because it depends on achieving a large initial capability lead and then preventing any subsequent competitors from arising), and that’s indicated by a “Yes” in this row.
Assumes strong global coordination
Does this success story assume that there is strong global coordination to prevent unaligned competitors from arising?
Controlled access
Does this success story assume that only a small number of people are given access to the safe/aligned AI?
Further Thoughts
This exercise made me realize that I’m confused about how the Pivotal Tool scenario is supposed to work, after the initial pivotal act is done. It would likely require several years or decades to fully solve AI safety/alignment and remove the dependence on human safety, but it’s not clear how to create a safe environment for doing that after the pivotal act.
One thing I’m less confused about now is why people who work toward the Contender scenarios are focused more on minimizing the capability trade-off of safety measures than people who work toward the Singleton scenarios even though the latter scenarios seem to demand more of a capability lead. It’s because the latter group of people think it’s possible or likely for a single AGI project to achieve a large initial capability advantage, in which case some initial capability trade-off due to safety measures is ok, and subsequent ongoing capability trade-off is not consequential because there would be no competitors left.
The comparison table makes Research Assistant seem a particularly attractive scenario to aim for, as a stepping stone to a more definitive success story. Is this conclusion actually justified?
Interim Quality-of-Life Improver also looks very attractive, if only strong global coordination could be achieved.
- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 126 points) (
- 2019 Review: Voting Results! by 1 Feb 2021 3:10 UTC; 99 points) (
- AI governance needs a theory of victory by 21 Jun 2024 16:08 UTC; 80 points) (EA Forum;
- Components of Strategic Clarity [Strategic Perspectives on Long-term AI Governance, #2] by 2 Jul 2022 11:22 UTC; 66 points) (EA Forum;
- Prize for Alignment Research Tasks by 29 Apr 2022 8:57 UTC; 64 points) (
- Two AI-risk-related game design ideas by 5 Aug 2021 13:36 UTC; 48 points) (
- AI governance needs a theory of victory by 21 Jun 2024 16:15 UTC; 34 points) (
- What are the differences between all the iterative/recursive approaches to AI alignment? by 21 Sep 2019 2:09 UTC; 33 points) (
- How worried should I be about a childless Disneyland? by 28 Oct 2019 15:32 UTC; 31 points) (EA Forum;
- 16 Mar 2022 18:43 UTC; 25 points) 's comment on Book Launch: The Engines of Cognition by (
- Deliberation as a method to find the “actual preferences” of humans by 22 Oct 2019 9:23 UTC; 23 points) (
- [AN #68]: The attainable utility theory of impact by 14 Oct 2019 17:00 UTC; 17 points) (
- What AI safety problems need solving for safe AI research assistants? by 5 Nov 2019 2:09 UTC; 14 points) (
- What are the high-level approaches to AI alignment? by 16 Jun 2020 17:10 UTC; 12 points) (
- Any further work on AI Safety Success Stories? by 2 Oct 2022 9:53 UTC; 8 points) (
- 12 Sep 2019 2:32 UTC; 6 points) 's comment on hereisonehand’s Shortform by (
- 22 May 2020 23:06 UTC; 5 points) 's comment on AGIs as collectives by (
- Any further work on AI Safety Success Stories? by 2 Oct 2022 11:59 UTC; 4 points) (EA Forum;
- 30 Dec 2020 3:54 UTC; 4 points) 's comment on Review Voting Thread by (
- 13 Sep 2019 3:37 UTC; 3 points) 's comment on [AN #63] How architecture search, meta learning, and environment design could lead to general intelligence by (
- 25 Sep 2019 2:01 UTC; 2 points) 's comment on This is a test post by (
I’m surprised this post didn’t get more comments and spark more further research. Rereading it, I think it’s both an excellent overview/distillation, and also a piece of strategy research in its own right. I wish there were more things like this. I think this post deserves to be expanded into a book or website and continually updated and refined.
Seems like a good starting point for discussion. Researchers need to have some picture of what AI alignment is “for,” in order to think about what research directions look most promising.