Epistemic status: I think something like this confusion is happening often. I’m not saying these are the only differences in what people mean by “AGI alignment”.
Summary:
Value alignment is better but probably harder to achieve than personal intent alignment to the short-term wants of some person(s). Different groups and people tend to primarily address one of these alignment targets when they discuss alignment. Confusion abounds.
One important confusion stems from an assumption that the type of AI defines the alignment target: strong goal-directed AGI must be value aligned or misaligned, while personal intent alignment is only viable for relatively weak AI. I think this assumption is important but false.
While value alignment is categorically better, intent alignment seems easier, safer, and more appealing in the short term, so AGI project leaders are likely to try it.[1]
Overview
Clarifying what people mean by alignment should dispel some illusory disagreement, and clarify alignment theory and predictions of AGI outcomes.
Caption: Venn diagram of three types of alignment targets. Value alignment and Personal intent alignment are both subsets of Evan Hubinger’s definition of intent alignment: AGI aligned with human intent in the broadest sense. Prosaic alignment work usually seems to be addressing a target somewhere in the neighborhood of personal intent alignment (following instructions or doing what this person wants now), while agent foundations and other conceptual alignment work usually seems to be addressing value alignment. Those two clusters have different strengths and weaknesses as alignment targets, so lumping them together produces confusion.
People mean different things when they say alignment. Some are mostly thinking about value alignment (VA): creating sovereign AGI that has values close enough to humans’ for our liking. Others are talking about making AGI that is corrigible (in the Christiano or Harms sense)[2] or follows instructions from its designated principal human(s). I’m going to use the term personal intent alignment (PIA) until someone has a better term for that type of alignment target. Different arguments and intuitions apply to these two alignment goals, so talking about them without differentiation is creating illusory disagreements.
Value alignment is better almost by definition, but personal intent alignment seems to avoid some of the biggest difficulties of value alignment. Max Harms’ recent sequence on corrigibility as a singular target (CAST) gives both a nice summary and detailed arguments. We do not need us to point to or define values, just short term preferences or instructions. The principal advantage is that an AGI that follows instructions can be used as a collaborator in improving its alignment over time; you don’t need to get it exactly right on the first try. This is more helpful in slower and more continuous takeoffs. This means that PI alignment has a larger basin of attraction than value alignment does.[3]
Most people who think alignment is fairly achievable seem to be thinking of PIA, while critics often respond thinking of value alignment. It would help to be explicit. PIA is probably easier and more likely than full VA for our first stabs at AGI, but there are reasons to wonder if it’s adequate for real success. In particular, there are intuitions and arguments that PIA doesn’t address the real problem of AGI alignment.
I think PIA does address the real problem, but in a non-obvious and counterintuitive way.
Another unstated divide
There’s another important clustering around these two conceptions of alignment. People who think about prosaic (and near term) AI alignment tend to be thinking about PIA, while those who think about aligning ASI for the long term are usually thinking of value alignment. The first group tends to have much lower estimates of alignment difficulty and p(doom) than the other. This causes dramatic disagreements on strategy and policy, which is a major problem: if the experts disagree, policy-makers are likely to just pick an expert that supports their own biases.
And All the Shoggoths Merely Players (edit: and its top comment thread continuation) is a detailed summary of (and a highly entertaining commentary on) the field’s current state of disagreement. In that dialogue, Simplicia Optimistovna asks whether the relative ease of getting LLMs to understand and do what we say is good news about alignment difficulty, while Doomimir Doomovitch sourly argues that this isn’t alignment at all; it’s just a system that superficially has behavior that you want (within the training set), without having actual goals to align. Actual AGI, he says, will have actual goals, whether we try (and likely fail) to engineer them in properly, or whether optimization creates a goal-directed search process with weird emergent goals.
I agree with Doomimir on this. Directing LLMs behavior isn’t alignment in the important sense. We will surely make truly goal-directed agents, probably sooner than later. And when we do, all that matters is whether their goals align closely enough with ours. Prosaic alignment for LLMs is not fully addressing the alignment problem for autonomous, competent AGI or ASI, even if they’re based on LLMs.[4]
However, I also agree with Simplicia: it’s good news that we’ve created AI that even sort of understands what we mean and does what we ask.
That’s because I think approximate understanding is good enough for personal intent alignment, and that personal intent alignment is workable for ASI. I think there’s a common and reasonable intuitions that it’s not, which create more illusory disagreements between those who mean PIA vs VA when they say “alignment”.
Personal intent alignment for full ASI: can I have your goals?
There’s an intuition that intent alignment isn’t workable for a full AGI; something that’s competent or self-aware usually[5] has its own goals, so it doesn’t just follow instructions.
But that intuition is based on our experience with existing minds. What if that synthetic being’s explicit, considered goal is to approximately follow instructions?
I think it’s possible for a fully self-aware, goal-oriented AGI to have its goal be, loosely speaking, a pointer to someone else’s goals. No human is oriented this way, but it seems conceptually coherent to want to do, with all of your heart, just what someone else wants.
It’s good news that LLMs have an approximate understanding of our instructions because that can, in theory, be plugged into the “goal slot” in a truly goal-directed agentic architecture. I have summarized proposals for how to do this for several possible AGI architectures (focusing on language model agents as IMO the most likely), but the details don’t matter here, just that it’s empirically possible to make an AI system that approximately understand what we want.
Conclusions
Approximate understanding and goal direction looks (to me) to be good enough for personal intent alignment, but not for value alignment.[1] And PIA does seem adequate for real AGI. Therefore, intent aligned AGI looks to be far easier and safer in the short term (parahuman AGI or pre-ASI) than trying for full value alignment and autonomy. And it can probably be leveraged into full value alignment (if we get an ASI acting as a full collaborator in value-aligning itself or a predecessor).
However, this alignment solution has a huge downside. It leaves fallible, selfish humans in charge of AGI systems. These will have immense destructive as well as creative potential. Having humans in charge of them allows for both conflict and ill use, a whole different set of ways we could get doom even if we solve technical alignment. The multipolar scenario with PI aligned, recursive self-improvement capable AGIs looks highly dangerous, but not like certain doom; see If we solve alignment, do we die anyway?
There’s another reason we might want to think more, and more explicitly, about intent alignment: it’s what we’re likely to try, even if it’s not the best idea. It’s hard to see how we could get a technical solution for value alignment that couldn’t also be used for intent alignment. And it seems likely that the types of humans actually in charge of AGI projects would rather implement personal intent alignment; everyone by definition prefers their values to the aggregate of humanity’s. If PIA seems even a little safer or better for them, it will serve as a justification for aligning their first AGIs as they’d prefer anyway: to follow their orders.
Where am I wrong? Where should this logic be extended or deepened? What issues would you like to see addressed in further treatments of this thesis?
Very approximate personal intent alignment might be good enough if it’s used even moderately wisely. More on this in Instruction-following AGI is easier and more likely than value aligned AGI. You can instruct your approximately-intent-aligned AGI to tell you about its internal workings, beliefs, goals, and counterfactuals. You can use that knowledge to improve its alignment, if it understands and follows instructions even approximately and most of the time. You can also instruct it to shut down if necessary.
One common objection is that if the AGI gets something slightly wrong, it might cause a disaster very quickly. A slow takeoff gives time with an AGI before it’s capable of doing that. And giving your AGI standing instructions to check that it’s understood what want before taking action reduces this possibility. This do what I mean and check (DWIMAC) strategy should dramatically reduce dangers of an AGI acting like a literal genie
A second common objection is that humans are bound to screw this up. That’s quite possible, but it’s also possible that they’ll get their shit together when it’s clear they need to. Given the salient reality of an alien but capable agent, the relevant humans may step up and take the matter seriously, as humans in historical crises seem to sometimes have done.
Personal intent alignment is roughly what Paul Christiano and Max Harms means by corrigibility.
It is definitely not what Eliezer Yudkowsky means by corrigibility. He originally coined the clever term, which we’re using now in somewhat different ways than as he carefully defined it: an agent that has its own consequentialist goals, but will allow itself to be corrected by being shut down or modified.
I agree with Eliezer that corrigibility as a secondary property would be anti-natural in that it would violate consequentialist rationality. Wanting to achieve a goal firmly implies not wanting to be modified, because that would mean stopping working toward that goal, making it less likely to be achieved. It would therefore seem difficult or impossible to implement that sort of corrigibility in a highly capable and therefore probably rational goal-oriented mind.
But making corrigibility (correctability) the sole goal- the singular target as Max puts it—avoids the conflict with other consequentialist goals. In that type of agent, consequentialist goals are always subgoals of the primary goal of doing what the principal wants or says (Max says this is a decent approximation but “doing what the principal wants” is not precisely what he means by his sense of corrigibility). Max and I agree that it’s safest if this is the singular or dominant goal of a real AGI. I currently slightly prefer the throughly instruction-following approach but that’s pending further thought and discussion.
This “your-goals-are-my-goals” alignment seems to not be exactly what Christiano means by corrigibility, nor is it precisely the alignment target implied in most other prosaic alignment work on LLM alignment. There, alignment targets are a mix of various ethical considerations along with following instructions. I’d want to make instruction-following clearly the prime goal to avoid shooting for value alignment and missing; that is, producing an agent that’s “decided” that it should pursue its (potentially vague) understanding of ethics instead of taking instructions and thereby remaining correctable.
Value alignment can also be said to have a basin of attraction: if you get it to approximately value what humans value, it can refine its understanding of exactly what humans value, and so improve its alignment. This can be described as its alignment falling into a basin of attraction. For more, and stronger arguments, see Requirements for a Basin of Attraction to Alignment.
The same can be said of personal intent alignment. If my AGI approximately wants to do what I say, it can refine its understanding of what I mean by what I say, and so improve its alignment. However, this has an extra dimension of alignment improvement: I can tell it to shut down to adjust its alignment, and I can tell it to explain its alignment and its motivations in detail to decide whether I should adjust them or order it to adjust them.
Thus, it seems to me that the metaphorical basin of attraction around PI alignment is categorically stronger than that around value alignment. I’d love to hear good counterarguments.
Here’s a little more on the argument that prosaic alignment isn’t addressing how LLMs would change as they’re turned into competent, agentic “real AGI”. Current LLMs are tool AI that doesn’t have explicitly represented and therefore flexible goals (a steering subsystem). Thus, they don’t in a rich sense have values or goals; they merely behave in ways that tend to carry out instructions in relatively ethical ways. Thus, they can’t be aligned in the original sense of having goals or values aligned with humanity’s.
On a more practical level, LLMs and foundation models don’t have the capacity to learn continuously reflect on and change their beliefs and goals that I’d expect a “real AGI” to have. Thus, they don’t face the The alignment stability problem. When such a system is made reflective and so more coherent, I worry that goals other than instruction-following might gain precedence, and the resulting AGI would no longer be instructable and therefore corrigible.
It looks to me like the bulk of work on prosaic alignment does not address those issues. Prosaic alignment work seems to implicitly assume that either we won’t make full AGI, or that learning to make LLMs do what we want will somehow extend to making full AGI that shares our goals. As outlined above, I think aligning LLMs will help align full AGI based on similar foundation models, but will not be adequate on its own.
If we simply left our AI systems goal-less “oracles”, like LLMs currently are, we’d have little to no takeover risk. I don’t think there’s any hope we do that. People want things done, and getting things done involves an agent setting goals and subgoals. See Steering subsystems: capabilities, agency, and alignment for the full argument. In addition, creating agents with reflection and autonomy is fascinating. And when it’s as easy as calling an oracle system repeatedly with the prompt “Continue pursuing goal X using tools Y”, there’s no real way to build really useful oracles without someone quickly using them to power dangerous agents.
Conflating value alignment and intent alignment is causing confusion
Epistemic status: I think something like this confusion is happening often. I’m not saying these are the only differences in what people mean by “AGI alignment”.
Summary:
Value alignment is better but probably harder to achieve than personal intent alignment to the short-term wants of some person(s). Different groups and people tend to primarily address one of these alignment targets when they discuss alignment. Confusion abounds.
One important confusion stems from an assumption that the type of AI defines the alignment target: strong goal-directed AGI must be value aligned or misaligned, while personal intent alignment is only viable for relatively weak AI. I think this assumption is important but false.
While value alignment is categorically better, intent alignment seems easier, safer, and more appealing in the short term, so AGI project leaders are likely to try it.[1]
Overview
Clarifying what people mean by alignment should dispel some illusory disagreement, and clarify alignment theory and predictions of AGI outcomes.
Caption: Venn diagram of three types of alignment targets. Value alignment and Personal intent alignment are both subsets of Evan Hubinger’s definition of intent alignment: AGI aligned with human intent in the broadest sense. Prosaic alignment work usually seems to be addressing a target somewhere in the neighborhood of personal intent alignment (following instructions or doing what this person wants now), while agent foundations and other conceptual alignment work usually seems to be addressing value alignment. Those two clusters have different strengths and weaknesses as alignment targets, so lumping them together produces confusion.
People mean different things when they say alignment. Some are mostly thinking about value alignment (VA): creating sovereign AGI that has values close enough to humans’ for our liking. Others are talking about making AGI that is corrigible (in the Christiano or Harms sense)[2] or follows instructions from its designated principal human(s). I’m going to use the term personal intent alignment (PIA) until someone has a better term for that type of alignment target. Different arguments and intuitions apply to these two alignment goals, so talking about them without differentiation is creating illusory disagreements.
Value alignment is better almost by definition, but personal intent alignment seems to avoid some of the biggest difficulties of value alignment. Max Harms’ recent sequence on corrigibility as a singular target (CAST) gives both a nice summary and detailed arguments. We do not need us to point to or define values, just short term preferences or instructions. The principal advantage is that an AGI that follows instructions can be used as a collaborator in improving its alignment over time; you don’t need to get it exactly right on the first try. This is more helpful in slower and more continuous takeoffs. This means that PI alignment has a larger basin of attraction than value alignment does.[3]
Most people who think alignment is fairly achievable seem to be thinking of PIA, while critics often respond thinking of value alignment. It would help to be explicit. PIA is probably easier and more likely than full VA for our first stabs at AGI, but there are reasons to wonder if it’s adequate for real success. In particular, there are intuitions and arguments that PIA doesn’t address the real problem of AGI alignment.
I think PIA does address the real problem, but in a non-obvious and counterintuitive way.
Another unstated divide
There’s another important clustering around these two conceptions of alignment. People who think about prosaic (and near term) AI alignment tend to be thinking about PIA, while those who think about aligning ASI for the long term are usually thinking of value alignment. The first group tends to have much lower estimates of alignment difficulty and p(doom) than the other. This causes dramatic disagreements on strategy and policy, which is a major problem: if the experts disagree, policy-makers are likely to just pick an expert that supports their own biases.
Thinking about one vs the other appears to be one major crux of disagreement on alignment difficulty.
And All the Shoggoths Merely Players (edit: and its top comment thread continuation) is a detailed summary of (and a highly entertaining commentary on) the field’s current state of disagreement. In that dialogue, Simplicia Optimistovna asks whether the relative ease of getting LLMs to understand and do what we say is good news about alignment difficulty, while Doomimir Doomovitch sourly argues that this isn’t alignment at all; it’s just a system that superficially has behavior that you want (within the training set), without having actual goals to align. Actual AGI, he says, will have actual goals, whether we try (and likely fail) to engineer them in properly, or whether optimization creates a goal-directed search process with weird emergent goals.
I agree with Doomimir on this. Directing LLMs behavior isn’t alignment in the important sense. We will surely make truly goal-directed agents, probably sooner than later. And when we do, all that matters is whether their goals align closely enough with ours. Prosaic alignment for LLMs is not fully addressing the alignment problem for autonomous, competent AGI or ASI, even if they’re based on LLMs.[4]
However, I also agree with Simplicia: it’s good news that we’ve created AI that even sort of understands what we mean and does what we ask.
That’s because I think approximate understanding is good enough for personal intent alignment, and that personal intent alignment is workable for ASI. I think there’s a common and reasonable intuitions that it’s not, which create more illusory disagreements between those who mean PIA vs VA when they say “alignment”.
Personal intent alignment for full ASI: can I have your goals?
There’s an intuition that intent alignment isn’t workable for a full AGI; something that’s competent or self-aware usually[5] has its own goals, so it doesn’t just follow instructions.
But that intuition is based on our experience with existing minds. What if that synthetic being’s explicit, considered goal is to approximately follow instructions?
I think it’s possible for a fully self-aware, goal-oriented AGI to have its goal be, loosely speaking, a pointer to someone else’s goals. No human is oriented this way, but it seems conceptually coherent to want to do, with all of your heart, just what someone else wants.
It’s good news that LLMs have an approximate understanding of our instructions because that can, in theory, be plugged into the “goal slot” in a truly goal-directed agentic architecture. I have summarized proposals for how to do this for several possible AGI architectures (focusing on language model agents as IMO the most likely), but the details don’t matter here, just that it’s empirically possible to make an AI system that approximately understand what we want.
Conclusions
Approximate understanding and goal direction looks (to me) to be good enough for personal intent alignment, but not for value alignment.[1] And PIA does seem adequate for real AGI. Therefore, intent aligned AGI looks to be far easier and safer in the short term (parahuman AGI or pre-ASI) than trying for full value alignment and autonomy. And it can probably be leveraged into full value alignment (if we get an ASI acting as a full collaborator in value-aligning itself or a predecessor).
However, this alignment solution has a huge downside. It leaves fallible, selfish humans in charge of AGI systems. These will have immense destructive as well as creative potential. Having humans in charge of them allows for both conflict and ill use, a whole different set of ways we could get doom even if we solve technical alignment. The multipolar scenario with PI aligned, recursive self-improvement capable AGIs looks highly dangerous, but not like certain doom; see If we solve alignment, do we die anyway?
There’s another reason we might want to think more, and more explicitly, about intent alignment: it’s what we’re likely to try, even if it’s not the best idea. It’s hard to see how we could get a technical solution for value alignment that couldn’t also be used for intent alignment. And it seems likely that the types of humans actually in charge of AGI projects would rather implement personal intent alignment; everyone by definition prefers their values to the aggregate of humanity’s. If PIA seems even a little safer or better for them, it will serve as a justification for aligning their first AGIs as they’d prefer anyway: to follow their orders.
Where am I wrong? Where should this logic be extended or deepened? What issues would you like to see addressed in further treatments of this thesis?
Very approximate personal intent alignment might be good enough if it’s used even moderately wisely. More on this in Instruction-following AGI is easier and more likely than value aligned AGI. You can instruct your approximately-intent-aligned AGI to tell you about its internal workings, beliefs, goals, and counterfactuals. You can use that knowledge to improve its alignment, if it understands and follows instructions even approximately and most of the time. You can also instruct it to shut down if necessary.
One common objection is that if the AGI gets something slightly wrong, it might cause a disaster very quickly. A slow takeoff gives time with an AGI before it’s capable of doing that. And giving your AGI standing instructions to check that it’s understood what want before taking action reduces this possibility. This do what I mean and check (DWIMAC) strategy should dramatically reduce dangers of an AGI acting like a literal genie
A second common objection is that humans are bound to screw this up. That’s quite possible, but it’s also possible that they’ll get their shit together when it’s clear they need to. Given the salient reality of an alien but capable agent, the relevant humans may step up and take the matter seriously, as humans in historical crises seem to sometimes have done.
Personal intent alignment is roughly what Paul Christiano and Max Harms means by corrigibility.
It is definitely not what Eliezer Yudkowsky means by corrigibility. He originally coined the clever term, which we’re using now in somewhat different ways than as he carefully defined it: an agent that has its own consequentialist goals, but will allow itself to be corrected by being shut down or modified.
I agree with Eliezer that corrigibility as a secondary property would be anti-natural in that it would violate consequentialist rationality. Wanting to achieve a goal firmly implies not wanting to be modified, because that would mean stopping working toward that goal, making it less likely to be achieved. It would therefore seem difficult or impossible to implement that sort of corrigibility in a highly capable and therefore probably rational goal-oriented mind.
But making corrigibility (correctability) the sole goal- the singular target as Max puts it—avoids the conflict with other consequentialist goals. In that type of agent, consequentialist goals are always subgoals of the primary goal of doing what the principal wants or says (Max says this is a decent approximation but “doing what the principal wants” is not precisely what he means by his sense of corrigibility). Max and I agree that it’s safest if this is the singular or dominant goal of a real AGI. I currently slightly prefer the throughly instruction-following approach but that’s pending further thought and discussion.
This “your-goals-are-my-goals” alignment seems to not be exactly what Christiano means by corrigibility, nor is it precisely the alignment target implied in most other prosaic alignment work on LLM alignment. There, alignment targets are a mix of various ethical considerations along with following instructions. I’d want to make instruction-following clearly the prime goal to avoid shooting for value alignment and missing; that is, producing an agent that’s “decided” that it should pursue its (potentially vague) understanding of ethics instead of taking instructions and thereby remaining correctable.
Value alignment can also be said to have a basin of attraction: if you get it to approximately value what humans value, it can refine its understanding of exactly what humans value, and so improve its alignment. This can be described as its alignment falling into a basin of attraction. For more, and stronger arguments, see Requirements for a Basin of Attraction to Alignment.
The same can be said of personal intent alignment. If my AGI approximately wants to do what I say, it can refine its understanding of what I mean by what I say, and so improve its alignment. However, this has an extra dimension of alignment improvement: I can tell it to shut down to adjust its alignment, and I can tell it to explain its alignment and its motivations in detail to decide whether I should adjust them or order it to adjust them.
Thus, it seems to me that the metaphorical basin of attraction around PI alignment is categorically stronger than that around value alignment. I’d love to hear good counterarguments.
Here’s a little more on the argument that prosaic alignment isn’t addressing how LLMs would change as they’re turned into competent, agentic “real AGI”. Current LLMs are tool AI that doesn’t have explicitly represented and therefore flexible goals (a steering subsystem). Thus, they don’t in a rich sense have values or goals; they merely behave in ways that tend to carry out instructions in relatively ethical ways. Thus, they can’t be aligned in the original sense of having goals or values aligned with humanity’s.
On a more practical level, LLMs and foundation models don’t have the capacity to learn continuously reflect on and change their beliefs and goals that I’d expect a “real AGI” to have. Thus, they don’t face the The alignment stability problem. When such a system is made reflective and so more coherent, I worry that goals other than instruction-following might gain precedence, and the resulting AGI would no longer be instructable and therefore corrigible.
It looks to me like the bulk of work on prosaic alignment does not address those issues. Prosaic alignment work seems to implicitly assume that either we won’t make full AGI, or that learning to make LLMs do what we want will somehow extend to making full AGI that shares our goals. As outlined above, I think aligning LLMs will help align full AGI based on similar foundation models, but will not be adequate on its own.
If we simply left our AI systems goal-less “oracles”, like LLMs currently are, we’d have little to no takeover risk. I don’t think there’s any hope we do that. People want things done, and getting things done involves an agent setting goals and subgoals. See Steering subsystems: capabilities, agency, and alignment for the full argument. In addition, creating agents with reflection and autonomy is fascinating. And when it’s as easy as calling an oracle system repeatedly with the prompt “Continue pursuing goal X using tools Y”, there’s no real way to build really useful oracles without someone quickly using them to power dangerous agents.