Do you currently have any objections to using “AI alignment” as the broader term (in line with the MIRI/Arbital definition and examples) and “AI motivation” as the narrower term (as suggested by Rohin)?
Yes:
The vast majority of existing usages of “alignment” should then be replaced by “motivation,” which is more specific and usually just as accurate. If you are going to split a term into new terms A and B, and you find that the vast majority of existing usage should be A, then I claim that “A” should be the one that keeps the old word.
The word “alignment” was chosen (originally be Stuart Russell I think) precisely because it is such a good name for the problem of aligning AI values with human values, it’s a word that correctly evokes what that problem is about. This is also how MIRI originally introduced the term. (I think they introduced it here, where they said “We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.””) Everywhere that anyone talks about alignment they use the analogy with “pointing,” and even MIRI folks usually talk about alignment as if it was mostly or entirely about pointing your AI in the right direction.
In contrast, “alignment” doesn’t really make sense as a name for the entire field of problems about making AI good. For the problem of making AI beneficial we already have the even older term “beneficial AI,” which really means exactly that. In explaining why MIRI doesn’t like that term, Rob said
Some of the main things I want from a term are:
A. It clearly and consistently keeps the focus on system design and engineering, and whatever technical/conceptual groundwork is needed to succeed at such. I want to make it easy for people (if they want to) to just hash out those technical issues, without feeling any pressure to dive into debates about bad actors and inter-group dynamics, or garden-variety machine ethics and moral philosophy, which carry a lot of derail / suck-the-energy-out-of-the-room risk.
[…] [“AI safety” or “beneficial AI”] doesn’t work so well for A—it’s commonly used to include things like misuse risk.”
[continuing last point] The proposed usage of “alignment” doesn’t meet this desiderata though, it has exactly the same problem as “beneficial AI,” except that it’s historically associated with this community. In particular it absolutely includes “garden-variety machine ethics and moral philosophy.” Yes, there is all sorts of stuff that MIRI or I wouldn’t care about that is relevant to “beneficial” AI, but under the proposed definition of alignment it’s also relevant to “aligned” AI. (This statement by Rob also makes me think that you wouldn’t in fact be happy with what he at least means by “alignment,” since I take it you explicitly mean to include moral philosophy?)
People have introduced a lot of terms and change terms frequently. I’ve changed the language on my blog multiple times at other people’s request. This isn’t costless, it really does make things more and more confusing.
I think “AI motivation” is not a good term for this area of study: it (a) suggests it’s about the study of AI motivation rather than engineering AI to be motivated to help humans, (b) is going to be perceived as aggressively anthropomorphizing (even if “alignment” is only slightly better), (c) is generally less optimized (related to the second point above, “alignment” is quite a good term for this area).
Probably “alignment” / “value alignment” would be a better split of terms than “alignment” vs. “motivation”. “Value alignment” has traditionally been used with the de re reading, but I could clarify that I’m working on de dicto value alignment when more precision is needed (everything I work on is also relevant on the de re reading, so the other interpretation is also accurate and just less precise).
I guess I have an analogous question for you: do you currently have any objections to using “beneficial AI” as the broader term, and “AI alignment” as the narrower term?
This is also how MIRI originally introduced the term. (I think they introduced it here, where they said “We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.”“)
But that definition seems quite different from your “A is trying to do what H wants it to do.” For example, if H has a wrong understanding of his/her true or normative values and as a result wants A to do something that is actually harmful, then under your definition A would be still be “aligned” but under MIRI’s definition it wouldn’t be (because it wouldn’t be pursuing beneficial goals).
This statement by Rob also makes me think that you wouldn’t in fact be happy with what he at least means by “alignment,” since I take it you explicitly mean to include moral philosophy?
I think that’s right. When I say MIRI/Arbital definition of “alignment” I’m referring to what’s they’ve posted publicly, and I believe it does include moral philosophy. Rob’s statement that you quoted seems to be a private one (I don’t recall seeing it before and can’t find it through Google search) but I can certainly see how it muddies the waters from your perspective.
Probably “alignment” / “value alignment” would be a better split of terms than “alignment” vs. “motivation”. “Value alignment” has traditionally been used with the de re reading, but I could clarify that I’m working on de dicto value alignment when more precision is needed
This seems fine to me, if you could give the benefit of doubt as to when more precision is needed. I’m basically worried about this scenario: You or someone else writes something like “I’m cautiously optimistic about Paul’s work.” The reader recalls seeing you say that you work on “value alignment”. They match that to what they’ve read from MIRI about how aligned AI “reliably pursues beneficial goals”, and end up thinking that is easier than you’d intend, or think there is more disagreement between alignment researchers about the difficulty of the broader problem than there is actually is. If you could consistently say that the goal of your work is “de dicto value alignment” then that removes most of my worry.
I guess I have an analogous question for you: do you currently have any objections to using “beneficial AI” as the broader term, and “AI alignment” as the narrower term?
This actually seems best to me on the merits of the terms alone (i.e., putting historical usage aside), and I’d be fine with it if everyone could coordinate to switch to these terms/definitions.
But that definition seems quite different from your “A is trying to do what H wants it to do.” For example, if H has a wrong understanding of his/her true or normative values and as a result wants A to do something that is actually harmful, then under your definition A would be still be “aligned” but under MIRI’s definition it wouldn’t be (because it wouldn’t be pursuing beneficial goals).
“Do what H wants me to do” seems to me to be an example of a beneficial goal, so I’d say a system which is trying to do what H wants it to do is pursuing a beneficial goals. It may also be pursuing subgoals which turn out to be harmful, if e.g. it’s wrong about what H wants or has other mistaken empirical beliefs. I don’t think anyone could be advocating the definition “pursues no harmful subgoals,” since that basically requires perfect empirical knowledge (it seems just as hard as never taking a harmful action). Does that seem right to you?
I’ve been assuming that “reliably pursues beneficial goals” is weaker than the definition I proposed, but practically equivalent as a research goal.
I’m basically worried about this scenario: You or someone else writes something like “I’m cautiously optimistic about Paul’s work.” The reader recalls seeing you say that you work on “value alignment”. They match that to what they’ve read from MIRI about how aligned AI “reliably pursues beneficial goals”, and end up thinking that is easier than you’d intend, or think there is more disagreement between alignment researchers about the difficulty of the broader problem than there is actually is. If you could consistently say that the goal of your work is “de dicto value alignment” then that removes most of my worry.
I think it’s reasonable for me to be more careful about clarifying what any particular line of research agenda does or does not aim to achieve. I think that in most contexts that is going to require more precision than just saying “AI alignment” regardless of how the term was defined, I normally clarify by saying something like “an AI which is at least trying to help us get what we want.”
This actually seems best to me on the merits of the terms alone (i.e., putting historical usage aside), and I’d be fine with it if everyone could coordinate to switch to these terms/definitions.
My guess is that MIRI folks won’t like the “beneficial AI” term because it is too broad a tent. (Which is also my objection to the proposed definition of “AI alignment,” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.”) My sense is that if that were their position, then you would also be unhappy with their proposed usage of “AI alignment,” since you seem to want a broad tent that makes minimal assumptions about what problems will turn out to be important. Does that seem right?
(They might also dislike “beneficial AI” because of random contingent facts about how it’s been used in the past, and so might want a different term with the same meaning.)
My own feeling is that using “beneficial AI” to mean “AI that produces good outcomes in the world” is basically just using “beneficial” in accordance with its usual meaning, and this isn’t a case where a special technical term is needed (and indeed it’s weird to have a technical term whose definition is precisely captured by a single—different—word).
“Do what H wants me to do” seems to me to be an example of a beneficial goal, so I’d say a system which is trying to do what H wants it to do is pursuing a beneficial goals. It may also be pursuing subgoals which turn out to be harmful, if e.g. it’s wrong about what H wants or has other mistaken empirical beliefs. I don’t think anyone could be advocating the definition “pursues no harmful subgoals,” since that basically requires perfect empirical knowledge (it seems just as hard as never taking a harmful action). Does that seem right to you?
I guess both “reliable” and “beneficial” are matters of degree so “aligned” in the sense of “reliably pursues beneficial goals” is also a matter of degree. “Do what H wants A to do” would be a moderate degree of alignment whereas “Successfully figuring out and satisfying H’s true/normative values” would be a much higher degree of alignment (in that sense of alignment). Meanwhile in your sense of alignment they are at best equally aligned and the latter might actually be less aligned if H has a wrong idea of metaethics or what his true/normative values are and as a result trying to figure out and satisfy those values is not something that H wants A to do.
I think that in most contexts that is going to require more precision than just saying “AI alignment” regardless of how the term was defined, I normally clarify by saying something like “an AI which is at least trying to help us get what we want.”
That seems good too.
My guess is that MIRI folks won’t like the “beneficial AI” term because it is too broad a tent. (Which is also my objection to the proposed definition of “AI alignment,” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.“) My sense is that if that were their position, then you would also be unhappy with their proposed usage of “AI alignment,” since you seem to want a broad tent that makes minimal assumptions about what problems will turn out to be important. Does that seem right?
This paragraph greatly confuses me. My understanding is that someone from MIRI (probably Eliezer) wrote the Arbital article defining “AI alignment” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world”, which satisfies my desire to have a broad tent term that makes minimal assumptions about what problems will turn out to be important. I’m fine with calling this “beneficial AI” instead of “AI alignment” if everyone can coordinate on this (but I don’t know how MIRI people feel about this). I don’t understand why you think ‘MIRI folks won’t like the “beneficial AI” term because it is too broad a tent’ given that someone from MIRI gave a very broad definition to “AI alignment”. Do you perhaps think that Arbital article was written by a non-MIRI person?
“Do what H wants A to do” would be a moderate degree of alignment whereas “Successfully figuring out and satisfying H’s true/normative values” would be a much higher degree of alignment (in that sense of alignment).
In what sense is that a more beneficial goal?
“Successfully do X” seems to be the same goal as X, isn’t it?
“Figure out H’s true/normative values” is manifestly a subgoal of “satisfy H’s true/normative values.” Why would we care about that except as a subgoal?
So is the difference entirely between “satisfy H’s true/normative values” and “do what H wants”? Do you disagree with one of the previous two bullet points? Is the difference that you think “reliably pursues” implies something about “actually achieves”?
If the difference is mostly between “what H wants” and “what H truly/normatively values”, then this is just a communication difficulty. For me adding “truly” or “normatively” to “values” is just emphasis and doesn’t change the meaning.
I try to make it clear that I’m using “want” to refer to some hard-to-define idealization rather than some narrow concept, but I can see how “want” might not be a good term for this, I’d be fine using “values” or something along those lines if that would be clearer.
(This is why I wrote:
What H wants” is even more problematic than “trying.” Clarifying what this expression means, and how to operationalize it in a way that could be used to inform an AI’s behavior, is part of the alignment problem. Without additional clarity on this concept, we will not be able to build an AI that tries to do what H wants it to do.
If the difference is mostly between “what H wants” and “what H truly/normatively values”, then this is just a communication difficulty. For me adding “truly” or “normatively” to “values” is just emphasis and doesn’t change the meaning.
Ah, yes that is a big part of what I thought was the difference. (Actually I may have understood at some point that you meant “want” in an idealized sense but then forgot and didn’t re-read the post to pick up that understanding again.)
ETA: I guess another thing that contributed to this confusion is your talk of values evolving over time, and of preferences about how they evolve, which seems to suggest that by “values” you mean something like “current understanding of values” or “interim values” rather than “true/normative values” since it doesn’t seem to make sense to want one’s true/normative values to change over time.
I try to make it clear that I’m using “want” to refer to some hard-to-define idealization rather than some narrow concept, but I can see how “want” might not be a good term for this, I’d be fine using “values” or something along those lines if that would be clearer.
I don’t think “values” is good either. Both “want” and “values” are commonly used words that typically (in everyday usage) mean something like “someone’s current understanding of what they want” or what I called “interim values”. I don’t see how you can expect people not to be frequently confused if you use either of them to mean “true/normative values”. Like the situation with de re / de dicto alignment, I suggest it’s not worth trying to economize on the adjectives here.
Another difference between your definition of alignment and “reliably pursues beneficial goals” is that the latter has “reliably” in it which suggests more of a de re reading. To use your example “Suppose A thinks that H likes apples, and so goes to the store to buy some apples, but H really prefers oranges.” I think most people would call an A that correctly understands H’s preferences (and gets oranges) more reliably pursuing beneficial goals.
Given this, perhaps the easiest way to reduce confusions moving forward is to just use some adjectives to distinguish your use of the words “want”, “values”, or “alignment” from other people’s.
If the difference is mostly between “what H wants” and “what H truly/normatively values”, then this is just a communication difficulty. For me adding “truly” or “normatively” to “values” is just emphasis and doesn’t change the meaning.
So “wants” means a want more general than an object-level desire (like wanting to buy oranges), and it already takes into account the possibility of H changing his mind about what he wants if H discovers that his wants contradict his normative values?
If that’s right, how is this generalization defined? (E.g. The CEV was “what H wants in the limit of infinite intelligence, reasoning time and complete information”.)
I don’t understand why you think ‘MIRI folks won’t like the “beneficial AI” term because it is too broad a tent’ given that someone from MIRI gave a very broad definition to “AI alignment”. Do you perhaps think that Arbital article was written by a non-MIRI person?
I don’t really know what anyone from MIRI thinks about this issue. It was a guess based on (a) the fact that Rob didn’t like a number of possible alternative terms to “alignment” because they seemed to be too broad a definition, (b) the fact that virtually every MIRI usage of “alignment” refers to a much narrower class of problems than “beneficial AI” is usually taken to refer to, (c) the fact that Eliezer generally seems frustrated with people talking about other problems under the heading of “beneficial AI.”
(But (c) might be driven by powerful AI vs. nearer-term concerns / all the other empirical errors Eliezer thinks people are making, (b) isn’t that indicative, and (a) might be driven by other cultural baggage associated with the term / Rob was speaking off the cuff and not attempting to speak formally for MIRI.)
I’d consider it great if we standardized on “beneficial AI” to mean “AI that has good consequences” and “AI alignment” to refer to the narrower problem of aligning AI’s motivation/preferences/goals.
Yes:
The vast majority of existing usages of “alignment” should then be replaced by “motivation,” which is more specific and usually just as accurate. If you are going to split a term into new terms A and B, and you find that the vast majority of existing usage should be A, then I claim that “A” should be the one that keeps the old word.
The word “alignment” was chosen (originally be Stuart Russell I think) precisely because it is such a good name for the problem of aligning AI values with human values, it’s a word that correctly evokes what that problem is about. This is also how MIRI originally introduced the term. (I think they introduced it here, where they said “We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.””) Everywhere that anyone talks about alignment they use the analogy with “pointing,” and even MIRI folks usually talk about alignment as if it was mostly or entirely about pointing your AI in the right direction.
In contrast, “alignment” doesn’t really make sense as a name for the entire field of problems about making AI good. For the problem of making AI beneficial we already have the even older term “beneficial AI,” which really means exactly that. In explaining why MIRI doesn’t like that term, Rob said
[continuing last point] The proposed usage of “alignment” doesn’t meet this desiderata though, it has exactly the same problem as “beneficial AI,” except that it’s historically associated with this community. In particular it absolutely includes “garden-variety machine ethics and moral philosophy.” Yes, there is all sorts of stuff that MIRI or I wouldn’t care about that is relevant to “beneficial” AI, but under the proposed definition of alignment it’s also relevant to “aligned” AI. (This statement by Rob also makes me think that you wouldn’t in fact be happy with what he at least means by “alignment,” since I take it you explicitly mean to include moral philosophy?)
People have introduced a lot of terms and change terms frequently. I’ve changed the language on my blog multiple times at other people’s request. This isn’t costless, it really does make things more and more confusing.
I think “AI motivation” is not a good term for this area of study: it (a) suggests it’s about the study of AI motivation rather than engineering AI to be motivated to help humans, (b) is going to be perceived as aggressively anthropomorphizing (even if “alignment” is only slightly better), (c) is generally less optimized (related to the second point above, “alignment” is quite a good term for this area).
Probably “alignment” / “value alignment” would be a better split of terms than “alignment” vs. “motivation”. “Value alignment” has traditionally been used with the de re reading, but I could clarify that I’m working on de dicto value alignment when more precision is needed (everything I work on is also relevant on the de re reading, so the other interpretation is also accurate and just less precise).
I guess I have an analogous question for you: do you currently have any objections to using “beneficial AI” as the broader term, and “AI alignment” as the narrower term?
But that definition seems quite different from your “A is trying to do what H wants it to do.” For example, if H has a wrong understanding of his/her true or normative values and as a result wants A to do something that is actually harmful, then under your definition A would be still be “aligned” but under MIRI’s definition it wouldn’t be (because it wouldn’t be pursuing beneficial goals).
I think that’s right. When I say MIRI/Arbital definition of “alignment” I’m referring to what’s they’ve posted publicly, and I believe it does include moral philosophy. Rob’s statement that you quoted seems to be a private one (I don’t recall seeing it before and can’t find it through Google search) but I can certainly see how it muddies the waters from your perspective.
This seems fine to me, if you could give the benefit of doubt as to when more precision is needed. I’m basically worried about this scenario: You or someone else writes something like “I’m cautiously optimistic about Paul’s work.” The reader recalls seeing you say that you work on “value alignment”. They match that to what they’ve read from MIRI about how aligned AI “reliably pursues beneficial goals”, and end up thinking that is easier than you’d intend, or think there is more disagreement between alignment researchers about the difficulty of the broader problem than there is actually is. If you could consistently say that the goal of your work is “de dicto value alignment” then that removes most of my worry.
This actually seems best to me on the merits of the terms alone (i.e., putting historical usage aside), and I’d be fine with it if everyone could coordinate to switch to these terms/definitions.
“Do what H wants me to do” seems to me to be an example of a beneficial goal, so I’d say a system which is trying to do what H wants it to do is pursuing a beneficial goals. It may also be pursuing subgoals which turn out to be harmful, if e.g. it’s wrong about what H wants or has other mistaken empirical beliefs. I don’t think anyone could be advocating the definition “pursues no harmful subgoals,” since that basically requires perfect empirical knowledge (it seems just as hard as never taking a harmful action). Does that seem right to you?
I’ve been assuming that “reliably pursues beneficial goals” is weaker than the definition I proposed, but practically equivalent as a research goal.
I think it’s reasonable for me to be more careful about clarifying what any particular line of research agenda does or does not aim to achieve. I think that in most contexts that is going to require more precision than just saying “AI alignment” regardless of how the term was defined, I normally clarify by saying something like “an AI which is at least trying to help us get what we want.”
My guess is that MIRI folks won’t like the “beneficial AI” term because it is too broad a tent. (Which is also my objection to the proposed definition of “AI alignment,” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.”) My sense is that if that were their position, then you would also be unhappy with their proposed usage of “AI alignment,” since you seem to want a broad tent that makes minimal assumptions about what problems will turn out to be important. Does that seem right?
(They might also dislike “beneficial AI” because of random contingent facts about how it’s been used in the past, and so might want a different term with the same meaning.)
My own feeling is that using “beneficial AI” to mean “AI that produces good outcomes in the world” is basically just using “beneficial” in accordance with its usual meaning, and this isn’t a case where a special technical term is needed (and indeed it’s weird to have a technical term whose definition is precisely captured by a single—different—word).
I guess both “reliable” and “beneficial” are matters of degree so “aligned” in the sense of “reliably pursues beneficial goals” is also a matter of degree. “Do what H wants A to do” would be a moderate degree of alignment whereas “Successfully figuring out and satisfying H’s true/normative values” would be a much higher degree of alignment (in that sense of alignment). Meanwhile in your sense of alignment they are at best equally aligned and the latter might actually be less aligned if H has a wrong idea of metaethics or what his true/normative values are and as a result trying to figure out and satisfy those values is not something that H wants A to do.
That seems good too.
This paragraph greatly confuses me. My understanding is that someone from MIRI (probably Eliezer) wrote the Arbital article defining “AI alignment” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world”, which satisfies my desire to have a broad tent term that makes minimal assumptions about what problems will turn out to be important. I’m fine with calling this “beneficial AI” instead of “AI alignment” if everyone can coordinate on this (but I don’t know how MIRI people feel about this). I don’t understand why you think ‘MIRI folks won’t like the “beneficial AI” term because it is too broad a tent’ given that someone from MIRI gave a very broad definition to “AI alignment”. Do you perhaps think that Arbital article was written by a non-MIRI person?
In what sense is that a more beneficial goal?
“Successfully do X” seems to be the same goal as X, isn’t it?
“Figure out H’s true/normative values” is manifestly a subgoal of “satisfy H’s true/normative values.” Why would we care about that except as a subgoal?
So is the difference entirely between “satisfy H’s true/normative values” and “do what H wants”? Do you disagree with one of the previous two bullet points? Is the difference that you think “reliably pursues” implies something about “actually achieves”?
If the difference is mostly between “what H wants” and “what H truly/normatively values”, then this is just a communication difficulty. For me adding “truly” or “normatively” to “values” is just emphasis and doesn’t change the meaning.
I try to make it clear that I’m using “want” to refer to some hard-to-define idealization rather than some narrow concept, but I can see how “want” might not be a good term for this, I’d be fine using “values” or something along those lines if that would be clearer.
(This is why I wrote:
)
Ah, yes that is a big part of what I thought was the difference. (Actually I may have understood at some point that you meant “want” in an idealized sense but then forgot and didn’t re-read the post to pick up that understanding again.)
ETA: I guess another thing that contributed to this confusion is your talk of values evolving over time, and of preferences about how they evolve, which seems to suggest that by “values” you mean something like “current understanding of values” or “interim values” rather than “true/normative values” since it doesn’t seem to make sense to want one’s true/normative values to change over time.
I don’t think “values” is good either. Both “want” and “values” are commonly used words that typically (in everyday usage) mean something like “someone’s current understanding of what they want” or what I called “interim values”. I don’t see how you can expect people not to be frequently confused if you use either of them to mean “true/normative values”. Like the situation with de re / de dicto alignment, I suggest it’s not worth trying to economize on the adjectives here.
Another difference between your definition of alignment and “reliably pursues beneficial goals” is that the latter has “reliably” in it which suggests more of a de re reading. To use your example “Suppose A thinks that H likes apples, and so goes to the store to buy some apples, but H really prefers oranges.” I think most people would call an A that correctly understands H’s preferences (and gets oranges) more reliably pursuing beneficial goals.
Given this, perhaps the easiest way to reduce confusions moving forward is to just use some adjectives to distinguish your use of the words “want”, “values”, or “alignment” from other people’s.
So “wants” means a want more general than an object-level desire (like wanting to buy oranges), and it already takes into account the possibility of H changing his mind about what he wants if H discovers that his wants contradict his normative values?
If that’s right, how is this generalization defined? (E.g. The CEV was “what H wants in the limit of infinite intelligence, reasoning time and complete information”.)
I don’t really know what anyone from MIRI thinks about this issue. It was a guess based on (a) the fact that Rob didn’t like a number of possible alternative terms to “alignment” because they seemed to be too broad a definition, (b) the fact that virtually every MIRI usage of “alignment” refers to a much narrower class of problems than “beneficial AI” is usually taken to refer to, (c) the fact that Eliezer generally seems frustrated with people talking about other problems under the heading of “beneficial AI.”
(But (c) might be driven by powerful AI vs. nearer-term concerns / all the other empirical errors Eliezer thinks people are making, (b) isn’t that indicative, and (a) might be driven by other cultural baggage associated with the term / Rob was speaking off the cuff and not attempting to speak formally for MIRI.)
I’d consider it great if we standardized on “beneficial AI” to mean “AI that has good consequences” and “AI alignment” to refer to the narrower problem of aligning AI’s motivation/preferences/goals.