Clarifying “AI Alignment”
When I say an AI A is aligned with an operator H, I mean:
A is trying to do what H wants it to do.
The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators.
This is significantly narrower than some other definitions of the alignment problem, so it seems important to clarify what I mean.
In particular, this is the problem of getting your AI to try to do the right thing, not the problem of figuring out which thing is right. An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.
Analogy
Consider a human assistant who is trying their hardest to do what H wants.
I’d say this assistant is aligned with H. If we build an AI that has an analogous relationship to H, then I’d say we’ve solved the alignment problem.
“Aligned” doesn’t mean “perfect:”
They could misunderstand an instruction, or be wrong about what H wants at a particular moment in time.
They may not know everything about the world, and so fail to recognize that an action has a particular bad side effect.
They may not know everything about H’s preferences, and so fail to recognize that a particular side effect is bad.
They may build an unaligned AI (while attempting to build an aligned AI).
I use alignment as a statement about the motives of the assistant, not about their knowledge or ability. Improving their knowledge or ability will make them a better assistant — for example, an assistant who knows everything there is to know about H is less likely to be mistaken about what H wants — but it won’t make them more aligned.
(For very low capabilities it becomes hard to talk about alignment. For example, if the assistant can’t recognize or communicate with H, it may not be meaningful to ask whether they are aligned with H.)
Clarifications
The definition is intended de dicto rather than de re. An aligned A is trying to “do what H wants it to do.” Suppose A thinks that H likes apples, and so goes to the store to buy some apples, but H really prefers oranges. I’d call this behavior aligned because A is trying to do what H wants, even though the thing it is trying to do (“buy apples”) turns out not to be what H wants: the de re interpretation is false but the de dicto interpretation is true.
An aligned AI can make errors, including moral or psychological errors, and fixing those errors isn’t part of my definition of alignment except insofar as it’s part of getting the AI to “try to do what H wants” de dicto. This is a critical difference between my definition and some other common definitions. I think that using a broader definition (or the de re reading) would also be defensible, but I like it less because it includes many subproblems that I think (a) are much less urgent, (b) are likely to involve totally different techniques than the urgent part of alignment.
An aligned AI would also be trying to do what H wants with respect to clarifying H’s preferences. For example, it should decide whether to ask if H prefers apples or oranges, based on its best guesses about how important the decision is to H, how confident it is in its current guess, how annoying it would be to ask, etc. Of course, it may also make a mistake at the meta level — for example, it may not understand when it is OK to interrupt H, and therefore avoid asking questions that it would have been better to ask.
This definition of “alignment” is extremely imprecise. I expect it to correspond to some more precise concept that cleaves reality at the joints. But that might not become clear, one way or the other, until we’ve made significant progress.
One reason the definition is imprecise is that it’s unclear how to apply the concepts of “intention,” “incentive,” or “motive” to an AI system. One naive approach would be to equate the incentives of an ML system with the objective it was optimized for, but this seems to be a mistake. For example, humans are optimized for reproductive fitness, but it is wrong to say that a human is incentivized to maximize reproductive fitness.
“What H wants” is even more problematic than “trying.” Clarifying what this expression means, and how to operationalize it in a way that could be used to inform an AI’s behavior, is part of the alignment problem. Without additional clarity on this concept, we will not be able to build an AI that tries to do what H wants it to do.
Postscript on terminological history
I originally described this problem as part of “the AI control problem,” following Nick Bostrom’s usage in Superintelligence, and used “the alignment problem” to mean “understanding how to build AI systems that share human preferences/values” (which would include efforts to clarify human preferences/values).
I adopted the new terminology after some people expressed concern with “the control problem.” There is also a slight difference in meaning: the control problem is about coping with the possibility that an AI would have different preferences from its operator. Alignment is a particular approach to that problem, namely avoiding the preference divergence altogether (so excluding techniques like “put the AI in a really secure box so it can’t cause any trouble”). There currently seems to be a tentative consensus in favor of this approach to the control problem.
I don’t have a strong view about whether “alignment” should refer to this problem or to something different. I do think that some term needs to refer to this problem, to separate it from other problems like “understanding what humans want,” “solving philosophy,” etc.
This post was originally published here on 7th April 2018.
The next post in this sequence will post on Saturday, and will be “An Unaligned Benchmark” by Paul Christiano.
Tomorrow’s AI Alignment Sequences post will be the first in a short new sequence of technical exercises from Scott Garrabrant.
- 2018 Review: Voting Results! by Jan 24, 2020, 2:00 AM; 135 points) (
- 2018 Review: Voting Results! by Jan 24, 2020, 2:00 AM; 135 points) (
- AI Alignment 2018-19 Review by Jan 28, 2020, 2:19 AM; 126 points) (
- Clarifying “What failure looks like” by Sep 20, 2020, 8:40 PM; 97 points) (
- Instruction-following AGI is easier and more likely than value aligned AGI by May 15, 2024, 7:38 PM; 79 points) (
- Clarifying some key hypotheses in AI alignment by Aug 15, 2019, 9:29 PM; 79 points) (
- Jun 21, 2020, 8:03 PM; 60 points) 's comment on The ground of optimization by (
- BASALT: A Benchmark for Learning from Human Feedback by Jul 8, 2021, 5:40 PM; 56 points) (
- Conclusion to the sequence on value learning by Feb 3, 2019, 9:05 PM; 51 points) (
- Modeling the impact of safety agendas by Nov 5, 2021, 7:46 PM; 51 points) (
- Useful Does Not Mean Secure by Nov 30, 2019, 2:05 AM; 46 points) (
- [AN #75]: Solving Atari and Go with learned game models, and thoughts from a MIRI employee by Nov 27, 2019, 6:10 PM; 38 points) (
- AI Safety Strategies Landscape by May 9, 2024, 5:33 PM; 34 points) (
- GPT-3 Gems by Jul 23, 2020, 12:46 AM; 33 points) (
- What precisely do we mean by AI alignment? by Dec 9, 2018, 2:23 AM; 29 points) (
- AI Alignment 2018-2019 Review by Jan 28, 2020, 9:14 PM; 28 points) (EA Forum;
- [AN #122]: Arguing for AGI-driven existential risk from first principles by Oct 21, 2020, 5:10 PM; 28 points) (
- [AN #112]: Engineering a Safer World by Aug 13, 2020, 5:20 PM; 26 points) (
- [AN #84] Reviewing AI alignment work in 2018-19 by Jan 29, 2020, 6:30 PM; 23 points) (
- Alignment Newsletter #33 by Nov 19, 2018, 5:20 PM; 23 points) (
- Alignment Newsletter #41 by Jan 17, 2019, 8:10 AM; 22 points) (
- [AN #95]: A framework for thinking about how to make AI go well by Apr 15, 2020, 5:10 PM; 20 points) (
- [AN #144]: How language models can also be finetuned for non-language tasks by Apr 2, 2021, 5:20 PM; 19 points) (
- [AN #104]: The perils of inaccessible information, and what we can learn about AI alignment from COVID by Jun 18, 2020, 5:10 PM; 19 points) (
- Definition of alignment science I like by Jan 6, 2025, 8:40 PM; 19 points) (
- [AN #96]: Buck and I discuss/argue about AI Alignment by Apr 22, 2020, 5:20 PM; 17 points) (
- Alignment Newsletter #50 by Mar 28, 2019, 6:10 PM; 15 points) (
- [AN #118]: Risks, solutions, and prioritization in a world with many AI systems by Sep 23, 2020, 6:20 PM; 15 points) (
- The Value Definition Problem by Nov 18, 2019, 7:56 PM; 15 points) (
- [AN #91]: Concepts, implementations, problems, and a benchmark for impact measurement by Mar 18, 2020, 5:10 PM; 15 points) (
- What do we *really* expect from a well-aligned AI? by Jan 4, 2021, 8:57 PM; 13 points) (
- [AN #107]: The convergent instrumental subgoals of goal-directed agents by Jul 16, 2020, 6:47 AM; 13 points) (
- Clarifying Alignment Fundamentals Through the Lens of Ontology by Oct 7, 2024, 8:57 PM; 12 points) (
- Feb 9, 2023, 4:23 PM; 8 points) 's comment on A (EtA: quick) note on terminology: AI Alignment != AI x-safety by (
- Nov 22, 2019, 9:07 PM; 5 points) 's comment on The Value Definition Problem by (
- Feb 9, 2023, 8:21 AM; 4 points) 's comment on A (EtA: quick) note on terminology: AI Alignment != AI x-safety by (
- Mar 21, 2019, 8:18 PM; 4 points) 's comment on Simplified preferences needed; simplified preferences sufficient by (
- May 16, 2023, 4:22 PM; 3 points) 's comment on All AGI Safety questions welcome (especially basic ones) [May 2023] by (EA Forum;
- Mar 29, 2021, 10:43 PM; 3 points) 's comment on Misalignment and misuse: whose values are manifest? by (
- Aug 17, 2020, 10:49 PM; 2 points) 's comment on My Understanding of Paul Christiano’s Iterated Amplification AI Safety Research Agenda by (
- May 9, 2020, 10:47 PM; 2 points) 's comment on AI Boxing for Hardware-bound agents (aka the China alignment problem) by (
- May 18, 2023, 1:34 AM; 1 point) 's comment on All AGI Safety questions welcome (especially basic ones) [May 2023] by (
- May 10, 2020, 12:11 AM; 1 point) 's comment on AI Boxing for Hardware-bound agents (aka the China alignment problem) by (
Nominating this primarily for Rohin’s comment on the post, which was very illuminating.
Crystallized my view of what the “core problem” is (as I explained in a comment on this post). I think I had intuitions of this form before, but at the very least this post clarified them.