Update: The original title “DeepMind alignment team’s strategy” was poorly chosen. Some readers seem to have interpreted the previous title as meaning that this was everything that we had thought about or wanted to say about an “alignment plan”, which is an unfortunate misunderstanding. We simply meant to share slides that gave a high-level outline of how we were thinking about our alignment plan, in the interest of partial communication rather than no communication.
I recently gave a talk about the DeepMind alignment team’s strategy at the SERI MATS seminar, sharing the slides here for anyone interested. This is an overview of our threat models, our high-level current plan, and how current projects fit into this plan.
Disclaimer: This talk represents the views of the alignment team and is not officially endorsed by DeepMind. This is a work in progress and is not intended to be a detailed or complete plan.
Let’s start with our threat model for alignment—how we expect AGI development to go and the main sources of risk.
Development model. We expect that AGI will likely arise in the form of scaled up foundation models fine tuned with RLHF, and that there are not many more fundamental innovations needed for AGI (though probably still a few). We also expect that the AGI systems we build will plausibly exhibit the following properties:
Goal-directedness. This means that the system generalizes to behave coherently towards a goal in new situations (though we don’t expect that it would necessarily generalize to all situations or become a expected utility maximizer).
Situational awareness. We expect that at some point an AGI system would develop a coherent understanding of its place in the world, e.g. knowing that it is running on a computer and being trained by human designers.
Risk model. Here is an overall picture from our recent post on Clarifying AI X-risk:
We consider possible technical causes of the risk, which are either specification gaming (SG) or goal misgeneralization (GMG), and the path that leads to existential risk, either through the interaction of multiple systems or through a misaligned power-seeking system.
Various threat models in alignment focus on different parts of this picture. Our particular threat model is focused on how the combination of SG and GMG can lead to misaligned power-seeking, so it is in the highlighted cluster above.
Conditional on AI existential risk happening, here is our most likely scenario for how it would occur (though we are uncertain about how likely this scenario is in absolute terms):
The main source of risk is a mix of specification gaming and (a bit more from) goal misgeneralization.
A misaligned consequentialist arises and seeks power. We expect this would arise mainly during RLHF rather than in the pretrained foundation model, because RLHF tends to make models more goal-directed, and the fine-tuning tasks benefit more from consequentialist planning.
This is not detected because deceptive alignment occurs (as a consequence of power-seeking), and because interpretability is hard.
Relevant decision-makers may not understand in time that this happening, if there is an inadequate societal response to warning shots for model properties like consequentialist planning, situational awareness and deceptive alignment.
Some things we agree with: we generally expect that capabilities easily generalize out of desired scope (#8) and possibly further than alignment (#21), inner alignment is a major issue and outer alignment is not enough (#16), and corrigibility is anti-natural (#23).
Some disagreements: we don’t think it’s impossible to cooperate to avoid or slow down AGI (#4), or that a “pivotal act” is necessary (#6), though we agree that it’s necessary to end the acute risk period in some way. We don’t think corrigibility is unsolvable (#24), and we think interpretability is possible though probably very hard (section B3). We expect some tradeoff between powerful and understandable systems (#30) but not a fundamental obstacle.
Note that this is a bit different from the summary of team opinions in our AGI ruin survey. The above summary is from the perspective of our alignment plan, rather than the average person on the team who filled out the survey.
Our approach. Our high level approach to alignment is to try to direct the training process towards aligned AI and away from misaligned AI. To illustrate this, imagine we have a space of possible models, where the red areas consist of misaligned models that are highly competent and cause catastrophic harm, and the blue areas consist of aligned models that are highly competent and don’t cause catastrophic harm. The training process moves through this space and by default ends up in a red area consisting of misaligned models. We aim to identify some key point on this path, for example a point where deception was rewarded, and apply some alignment technique that directs the training process to a blue area of aligned models instead.
We can break down our high-level approach into work on alignment components, which focuses on building different elements of an aligned system, and alignment enablers, which make it easier to get the alignment components right.
Components: build aligned models
Outer alignment
Scalable oversight (Sparrow, debate)
Process-based feedback
Inner alignment
Mitigating goal misgeneralization
Red-teaming
Enablers: detect models with dangerous properties
Detect misaligned reasoning
Looking at internal reasoning (mechanistic interpretability)
Cross-examination (and consistency checks more generally)
Detect capability transitions
Capability evaluations
Predicting phase transitions (e.g. grokking)
Detect goal-directedness
Teams and projects. Now we’ll briefly review what we are working on now and how that fits into the plan. The most relevant teams are Scalable Alignment, Alignment, and Strategy & Governance. I would say that Scalable Alignment is mostly working on components and the other two teams are mostly working on enablers. Note that this doesn’t include everyone doing relevant work at DeepMind.
Relative to OpenAI’s plan. Our plan is similar to OpenAI’s approach in terms of components—we are also doing scalable oversight based on RLHF. We are less confident in components working by default, and are relying more on enablers such as mechanistic interpretability and capability evaluations.
A major part of OpenAI’s plan is to use large language models and other AI tools for alignment research. This a less prominent part of our plan, and we mostly count on those tools being produced outside of our alignment teams (either by capabilities teams or external alignment researchers).
General hopes. Our plan is based on some general hopes:
The most harmful outcomes happen when the AI “knows” it is doing something that we don’t want, so mitigations can be targeted at this case.
Our techniques don’t have to stand up to misaligned superintelligences—the hope is that they make a difference while the training process is in the gray area, not after it has reached the red area.
In terms of directing the training process, the game is skewed in our favour: we can restart the search, examine and change the model’s beliefs and goals using interpretability techniques, choose exactly what data the model sees, etc.
Interpretability is hard but not impossible.
We can train against our alignment techniques and get evidence on whether the AI systems deceive our techniques. If we get evidence that they are likely to do that, we can use this to create demonstrations of bad behavior for decision-makers.
Overall, while alignment is a difficult problem, we think there are some reasons for optimism.
Takeaways. Our main threat model is basically a combination of SG and GMG leading to misaligned power-seeking. Our high-level approach is trying to direct the training process towards aligned AI and away from misaligned AI. There is a lot of alignment work going on at DeepMind, with particularly big bets on scalable oversight, mechanistic interpretability and capability evaluations.
[Linkpost] Some high-level thoughts on the DeepMind alignment team’s strategy
Link post
Update: The original title “DeepMind alignment team’s strategy” was poorly chosen. Some readers seem to have interpreted the previous title as meaning that this was everything that we had thought about or wanted to say about an “alignment plan”, which is an unfortunate misunderstanding. We simply meant to share slides that gave a high-level outline of how we were thinking about our alignment plan, in the interest of partial communication rather than no communication.
I recently gave a talk about the DeepMind alignment team’s strategy at the SERI MATS seminar, sharing the slides here for anyone interested. This is an overview of our threat models, our high-level current plan, and how current projects fit into this plan.
Disclaimer: This talk represents the views of the alignment team and is not officially endorsed by DeepMind. This is a work in progress and is not intended to be a detailed or complete plan.
Let’s start with our threat model for alignment—how we expect AGI development to go and the main sources of risk.
Development model. We expect that AGI will likely arise in the form of scaled up foundation models fine tuned with RLHF, and that there are not many more fundamental innovations needed for AGI (though probably still a few). We also expect that the AGI systems we build will plausibly exhibit the following properties:
Goal-directedness. This means that the system generalizes to behave coherently towards a goal in new situations (though we don’t expect that it would necessarily generalize to all situations or become a expected utility maximizer).
Situational awareness. We expect that at some point an AGI system would develop a coherent understanding of its place in the world, e.g. knowing that it is running on a computer and being trained by human designers.
Risk model. Here is an overall picture from our recent post on Clarifying AI X-risk:
We consider possible technical causes of the risk, which are either specification gaming (SG) or goal misgeneralization (GMG), and the path that leads to existential risk, either through the interaction of multiple systems or through a misaligned power-seeking system.
Various threat models in alignment focus on different parts of this picture. Our particular threat model is focused on how the combination of SG and GMG can lead to misaligned power-seeking, so it is in the highlighted cluster above.
Conditional on AI existential risk happening, here is our most likely scenario for how it would occur (though we are uncertain about how likely this scenario is in absolute terms):
The main source of risk is a mix of specification gaming and (a bit more from) goal misgeneralization.
A misaligned consequentialist arises and seeks power. We expect this would arise mainly during RLHF rather than in the pretrained foundation model, because RLHF tends to make models more goal-directed, and the fine-tuning tasks benefit more from consequentialist planning.
This is not detected because deceptive alignment occurs (as a consequence of power-seeking), and because interpretability is hard.
Relevant decision-makers may not understand in time that this happening, if there is an inadequate societal response to warning shots for model properties like consequentialist planning, situational awareness and deceptive alignment.
We can connect this threat model to our views on MIRI’s arguments for AGI ruin.
Some things we agree with: we generally expect that capabilities easily generalize out of desired scope (#8) and possibly further than alignment (#21), inner alignment is a major issue and outer alignment is not enough (#16), and corrigibility is anti-natural (#23).
Some disagreements: we don’t think it’s impossible to cooperate to avoid or slow down AGI (#4), or that a “pivotal act” is necessary (#6), though we agree that it’s necessary to end the acute risk period in some way. We don’t think corrigibility is unsolvable (#24), and we think interpretability is possible though probably very hard (section B3). We expect some tradeoff between powerful and understandable systems (#30) but not a fundamental obstacle.
Note that this is a bit different from the summary of team opinions in our AGI ruin survey. The above summary is from the perspective of our alignment plan, rather than the average person on the team who filled out the survey.
Our approach. Our high level approach to alignment is to try to direct the training process towards aligned AI and away from misaligned AI. To illustrate this, imagine we have a space of possible models, where the red areas consist of misaligned models that are highly competent and cause catastrophic harm, and the blue areas consist of aligned models that are highly competent and don’t cause catastrophic harm. The training process moves through this space and by default ends up in a red area consisting of misaligned models. We aim to identify some key point on this path, for example a point where deception was rewarded, and apply some alignment technique that directs the training process to a blue area of aligned models instead.
We can break down our high-level approach into work on alignment components, which focuses on building different elements of an aligned system, and alignment enablers, which make it easier to get the alignment components right.
Components: build aligned models
Outer alignment
Scalable oversight (Sparrow, debate)
Process-based feedback
Inner alignment
Mitigating goal misgeneralization
Red-teaming
Enablers: detect models with dangerous properties
Detect misaligned reasoning
Looking at internal reasoning (mechanistic interpretability)
Cross-examination (and consistency checks more generally)
Detect capability transitions
Capability evaluations
Predicting phase transitions (e.g. grokking)
Detect goal-directedness
Teams and projects. Now we’ll briefly review what we are working on now and how that fits into the plan. The most relevant teams are Scalable Alignment, Alignment, and Strategy & Governance. I would say that Scalable Alignment is mostly working on components and the other two teams are mostly working on enablers. Note that this doesn’t include everyone doing relevant work at DeepMind.
Scalable alignment (led by Geoffrey Irving):
Sparrow
Paper: Improving alignment of dialogue agents via targeted human judgements
Process-based feedback
Paper: Solving math word problems with process- and outcome-based feedback
Red-teaming
Paper: Red Teaming Language Models with Language Models
Alignment (led by Rohin Shah):
Capability evaluations (led by Mary Phuong, in collaboration with other labs)
Mechanistic interpretability (led by Vladimir Mikulik)
Paper: Tracr: Compiled Transformers as a Laboratory for Interpretability
Goal misgeneralization (led by Rohin Shah)
Paper: Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals
Causal alignment (led by Tom Everitt)
Paper: Discovering Agents
Internal outreach (led by Victoria Krakovna)
Strategy & Governance (led by Allan Dafoe):
Capability evaluations
Institutional engagement / internal outreach
(Lots of other things)
Relative to OpenAI’s plan. Our plan is similar to OpenAI’s approach in terms of components—we are also doing scalable oversight based on RLHF. We are less confident in components working by default, and are relying more on enablers such as mechanistic interpretability and capability evaluations.
A major part of OpenAI’s plan is to use large language models and other AI tools for alignment research. This a less prominent part of our plan, and we mostly count on those tools being produced outside of our alignment teams (either by capabilities teams or external alignment researchers).
General hopes. Our plan is based on some general hopes:
The most harmful outcomes happen when the AI “knows” it is doing something that we don’t want, so mitigations can be targeted at this case.
Our techniques don’t have to stand up to misaligned superintelligences—the hope is that they make a difference while the training process is in the gray area, not after it has reached the red area.
In terms of directing the training process, the game is skewed in our favour: we can restart the search, examine and change the model’s beliefs and goals using interpretability techniques, choose exactly what data the model sees, etc.
Interpretability is hard but not impossible.
We can train against our alignment techniques and get evidence on whether the AI systems deceive our techniques. If we get evidence that they are likely to do that, we can use this to create demonstrations of bad behavior for decision-makers.
Overall, while alignment is a difficult problem, we think there are some reasons for optimism.
Takeaways. Our main threat model is basically a combination of SG and GMG leading to misaligned power-seeking. Our high-level approach is trying to direct the training process towards aligned AI and away from misaligned AI. There is a lot of alignment work going on at DeepMind, with particularly big bets on scalable oversight, mechanistic interpretability and capability evaluations.