Clarifying AI X-risk
TL;DR: We give a threat model literature review, propose a categorization and describe a consensus threat model from some of DeepMind’s AGI safety team. See our post for the detailed literature review.
The DeepMind AGI Safety team has been working to understand the space of threat models for existential risk (X-risk) from misaligned AI. This post summarizes our findings. Our aim was to clarify the case for X-risk to enable better research project generation and prioritization.
First, we conducted a literature review of existing threat models, discussed their strengths/weaknesses and then formed a categorization based on the technical cause of X-risk and the path that leads to X-risk. Next we tried to find consensus within our group on a threat model that we all find plausible.
Our overall take is that there may be more agreement between alignment researchers than their disagreements might suggest, with many of the threat models, including our own consensus one, making similar arguments for the source of risk. Disagreements remain over the difficulty of the alignment problem, and what counts as a solution.
Categorization
Here we present our categorization of threat models from our literature review, based on the technical cause and the path leading to X-risk. It is summarized in the diagram below.
In green on the left we have the technical cause of the risk, either specification gaming (SG) or goal misgeneralization (GMG). In red on the right we have the path that leads to X-risk, either through the interaction of multiple systems, or through a misaligned power-seeking (MAPS) system. The threat models appear as arrows from technical cause towards path to X-risk.
The technical causes (SG and GMG) are not mutually exclusive, both can occur within the same threat model. The distinction between them is motivated by the common distinction in machine learning between failures on the training distribution, and when out of distribution.
To classify as specification gaming, there needs to be bad feedback provided on the actual training data. There are many ways to operationalize good/bad feedback. The choice we make here is that the training data feedback is good if it rewards exactly those outputs that would be chosen by a competent, well-motivated AI[1]. We note that the main downside to this operationalisation is that even if just one out of a huge number of training data points gets bad feedback, then we would classify the failure as specification gaming, even though that one datapoint likely made no difference.
To classify as goal misgeneralization, the behavior when out-of-distribution (i.e. not using input from the training data), generalizes poorly about its goal, while its capabilities generalize well, leading to undesired behavior. This means the AI system doesn’t just break entirely, it still competently pursues some goal, but it’s not the goal we intended.
The path leading to X-risk is classified as follows. When the path to X-risk is from the interaction of multiple systems, the defining feature here is not just that there are multiple AI systems (we think this will be the case in all realistic threat models), it’s more that the risk is caused by complicated interactions between systems that we heavily depend on and can’t easily stop or transition away from. (Note that we haven’t analyzed the multiple-systems case very much, and there are also other technical causes for those kinds of scenarios.)
When the path to X-risk is through Misaligned Power-Seeking (MAPS), the AI system seeks power in unintended ways due to problems with its goals. Here, power-seeking means the AI system seeks power as an instrumental subgoal, because having more power increases the options available to the system allowing it to do better at achieving its goals. Misaligned here means that the goal that the AI system pursues is not what its designers intended[2].
There are other plausible paths to X-risk (see e.g. this list), though our focus here was on the most popular writings on threat models in which the main source of risk is technical, rather than through poor decisions made by humans in how to use AI.
For a summary on the properties of the threat models, see the table below.
Source of misalignment | ||||
Specification gaming (SG) | SG + GMG | Goal mis-generalization (GMG) | ||
Path to X-risk | Misaligned power seeking (MAPS) | Cohen et al | Carlsmith, Christiano2, Cotra, Ngo, Shah | |
Interaction of multiple systems | ? | ? |
We can see that five of the threat models we considered substantially involve both specification gaming and goal misgeneralization (note that these threat models would still hold if one of the risk sources was absent) as the source of misalignment, and MAPS as the path to X-risk. This seems like an area where multiple researchers agree on the bare bones of the threat model—indeed our group’s consensus threat model was in this category too.
One aspect that our categorization has highlighted is that there are potential gaps in the literature, as emphasized by the question marks in the table above for paths to X-risk via the interaction of multiple systems, where the source of misalignment involves goal misgeneralization. It would be interesting to see some threat models that fill this gap.
For other overviews of different threat models, see here and here.
Consensus Threat Model
Building on this literature review we looked for consensus among our group of AGI safety researchers. We asked ourselves the question: conditional on there being an existential catastrophe from misaligned AI, what is the most likely threat model that brought this about. This is independent of the probability of an occurrence of an existential catastrophe from misaligned AI. Our resulting threat model is as follows (black bullets indicate agreement, white indicates some variability among the group):
Development model:
Scaled up deep learning foundation models with RL from human feedback (RLHF) fine-tuning.
Not many more fundamental innovations needed for AGI.
Risk model:
Main source of risk is a mix of specification gaming and (a bit more from) goal misgeneralization.
A misaligned consequentialist arises and seeks power (misaligned mostly because of goal misgeneralization).
Perhaps this arises mainly during RLHF rather than in the pretrained foundation model because the tasks for which we use RLHF will benefit much more from consequentialist planning than the pretraining task.
We don’t catch this because deceptive alignment occurs (a consequence of power-seeking)
Perhaps certain architectural components such as a tape/scratchpad for memory and planning would accelerate this.
Important people won’t understand: inadequate societal response to warning shots on consequentialist planning, strategic awareness and deceptive alignment.
Perhaps it’s unclear who actually controls AI development.
Interpretability will be hard.
By misaligned consequentialist we mean
It uses consequentialist reasoning: a system that evaluates the outcomes of various possible plans against some metric, and chooses the plan that does best on that metric
Is misaligned—the metric it uses is not a goal that we intended the system to have
Overall we hope our threat model strikes the right balance of giving detail where we think it’s useful, without being too specific (which carries a higher risk of distracting from the essential points, and higher chance of being wrong).
Takeaway
Overall we thought that alignment researchers agree on quite a lot regarding the sources of risk (the collection of threat models in blue in the diagram). Our group’s consensus threat model is also in this part of threat model space (the closest existing threat model is Cotra).
- ^
In this definition, whether the feedback is good/bad does not depend on the reasoning used by the AI system, so e.g. rewarding an action that was chosen by a misaligned AI system that is trying to hide its misaligned intentions would still count as good feedback under this definition.
- ^
There are other possible formulations of misaligned, for example the system’s goal may not match what its users want it to do.
- [Linkpost] Some high-level thoughts on the DeepMind alignment team’s strategy by 7 Mar 2023 11:55 UTC; 128 points) (
- AI Safety − 7 months of discussion in 17 minutes by 15 Mar 2023 23:41 UTC; 89 points) (EA Forum;
- New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?” by 15 Nov 2023 17:16 UTC; 80 points) (
- How MATS addresses “mass movement building” concerns by 4 May 2023 0:55 UTC; 79 points) (EA Forum;
- Threat Model Literature Review by 1 Nov 2022 11:03 UTC; 77 points) (
- New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?” by 15 Nov 2023 17:16 UTC; 71 points) (EA Forum;
- Aptitudes for AI governance work by 13 Jun 2023 13:54 UTC; 67 points) (EA Forum;
- AI Neorealism: a threat model & success criterion for existential safety by 15 Dec 2022 13:42 UTC; 67 points) (
- How MATS addresses “mass movement building” concerns by 4 May 2023 0:55 UTC; 62 points) (
- Future Matters #6: FTX collapse, value lock-in, and counterarguments to AI x-risk by 30 Dec 2022 13:10 UTC; 58 points) (EA Forum;
- Voting Results for the 2022 Review by 2 Feb 2024 20:34 UTC; 57 points) (
- AI Safety 101 : Capabilities—Human Level AI, What? How? and When? by 7 Mar 2024 17:29 UTC; 46 points) (
- EA & LW Forums Weekly Summary (31st Oct − 6th Nov 22′) by 8 Nov 2022 3:58 UTC; 39 points) (EA Forum;
- Framing AI strategy by 7 Feb 2023 19:20 UTC; 33 points) (
- Navigating the Open-Source AI Landscape: Data, Funding, and Safety by 13 Apr 2023 15:29 UTC; 32 points) (
- 18 Dec 2023 23:33 UTC; 29 points) 's comment on What is the current most representative EA AI x-risk argument? by (EA Forum;
- AI Safety − 7 months of discussion in 17 minutes by 15 Mar 2023 23:41 UTC; 25 points) (
- Navigating the Open-Source AI Landscape: Data, Funding, and Safety by 12 Apr 2023 10:30 UTC; 23 points) (EA Forum;
- Cheat sheet of AI X-risk by 29 Jun 2023 4:28 UTC; 19 points) (
- EA & LW Forums Weekly Summary (31st Oct − 6th Nov 22′) by 8 Nov 2022 3:58 UTC; 12 points) (
I continue to endorse this categorization of threat models and the consensus threat model. I often refer people to this post and use the “SG + GMG → MAPS” framing in my alignment overview talks. I remain uncertain about the likelihood of the deceptive alignment part of the threat model (in particular the requisite level of goal-directedness) arising in the LLM paradigm, relative to other mechanisms for AI risk.
In terms of adding new threat models to the categorization, the main one that comes to mind is Deep Deceptiveness (let’s call it Soares2), which I would summarize as “non-deceptiveness is anti-natural / hard to disentangle from general capabilities”. I would probably put this under “SG → MAPS”, assuming an irreducible kind of specification gaming where it’s very difficult (or impossible) to distinguish deceptiveness from non-deceptiveness (including through feedback on the model’s reasoning process). Though it could also be GMG, where the “non-deceptiveness” concept is incoherent and thus very difficult to generalize well.