Wei Dai comments on Clarifying “AI Alignment”

Wei Dai 24 Nov 2018 21:58 UTC
LW: 2 AF: 1
AF

If you think this risk is very large, presumably there is some positive argument for why it’s so large?

Yeah, I didn’t literally mean that I don’t have any arguments, but rather that we’ve discussed it in the past and it seems like we didn’t get close to resolving our disagreement. I tend to think that Aumann Agreement doesn’t apply to humans, and it’s fine to disagree on these kinds of things. Even if agreement ought to be possible in principle (which again I don’t think is necessarily true for humans), if you think that even from your perspective the value drift/corruption problem is currently overly neglected, then we can come back and revisit this at another time (e.g., when you think there’s too many people working on this problem, which might never actually happen).

it seems like the basic problem is that most people just don’t care about averting drift at all, or have any inclination to be thoughtful about how their own preferences evolve

I don’t understand how this is compatible with only 2% loss from value drift/corruption. Do you perhaps think the actual loss is much bigger, but almost certainly we just can’t do anything about it, so 2% is how much you expect we can potentially “save” from value drift/corruption? Or are you taking an anti-realist position and saying something like, if someone doesn’t care about averting drift/corruption, then however their values drift that doesn’t constitute any loss?

The narrower definition doesn’t contain every problem that is high priority, it just contains a single high priority problem, which is better than a really broad basket containing a mix of important and not-that-important stuff.

I don’t understand “better” in what sense. Whatever it is, why wouldn’t it be even better to have two terms, one of which is broadly defined so as to include all the problems that might be urgent but also includes lower priority problems and problems whose priority we’re not sure about, and another one that is defined to be a specific urgent problem. Do you currently have any objections to using “AI alignment” as the broader term (in line with the MIRI/Arbital definition and examples) and “AI motivation” as the narrower term (as suggested by Rohin)?
- paulfchristiano 26 Nov 2018 22:24 UTC
  LW: 6 AF: 4
  AF Parent
  Do you currently have any objections to using “AI alignment” as the broader term (in line with the MIRI/Arbital definition and examples) and “AI motivation” as the narrower term (as suggested by Rohin)?
  
  Yes:
  - The vast majority of existing usages of “alignment” should then be replaced by “motivation,” which is more specific and usually just as accurate. If you are going to split a term into new terms A and B, and you find that the vast majority of existing usage should be A, then I claim that “A” should be the one that keeps the old word.
  - The word “alignment” was chosen (originally be Stuart Russell I think) precisely because it is such a good name for the problem of aligning AI values with human values, it’s a word that correctly evokes what that problem is about. This is also how MIRI originally introduced the term. (I think they introduced it here, where they said “We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.””) Everywhere that anyone talks about alignment they use the analogy with “pointing,” and even MIRI folks usually talk about alignment as if it was mostly or entirely about pointing your AI in the right direction.
  - In contrast, “alignment” doesn’t really make sense as a name for the entire field of problems about making AI good. For the problem of making AI beneficial we already have the even older term “beneficial AI,” which really means exactly that. In explaining why MIRI doesn’t like that term, Rob said
  Some of the main things I want from a term are:
  
  A. It clearly and consistently keeps the focus on system design and engineering, and whatever technical/conceptual groundwork is needed to succeed at such. I want to make it easy for people (if they want to) to just hash out those technical issues, without feeling any pressure to dive into debates about bad actors and inter-group dynamics, or garden-variety machine ethics and moral philosophy, which carry a lot of derail / suck-the-energy-out-of-the-room risk.
  
  […] [“AI safety” or “beneficial AI”] doesn’t work so well for A—it’s commonly used to include things like misuse risk.”
  - [continuing last point] The proposed usage of “alignment” doesn’t meet this desiderata though, it has exactly the same problem as “beneficial AI,” except that it’s historically associated with this community. In particular it absolutely includes “garden-variety machine ethics and moral philosophy.” Yes, there is all sorts of stuff that MIRI or I wouldn’t care about that is relevant to “beneficial” AI, but under the proposed definition of alignment it’s also relevant to “aligned” AI. (This statement by Rob also makes me think that you wouldn’t in fact be happy with what he at least means by “alignment,” since I take it you explicitly mean to include moral philosophy?)
  - People have introduced a lot of terms and change terms frequently. I’ve changed the language on my blog multiple times at other people’s request. This isn’t costless, it really does make things more and more confusing.
  - I think “AI motivation” is not a good term for this area of study: it (a) suggests it’s about the study of AI motivation rather than engineering AI to be motivated to help humans, (b) is going to be perceived as aggressively anthropomorphizing (even if “alignment” is only slightly better), (c) is generally less optimized (related to the second point above, “alignment” is quite a good term for this area).
  - Probably “alignment” / “value alignment” would be a better split of terms than “alignment” vs. “motivation”. “Value alignment” has traditionally been used with the de re reading, but I could clarify that I’m working on de dicto value alignment when more precision is needed (everything I work on is also relevant on the de re reading, so the other interpretation is also accurate and just less precise).
  I guess I have an analogous question for you: do you currently have any objections to using “beneficial AI” as the broader term, and “AI alignment” as the narrower term?
  - Wei Dai 27 Nov 2018 0:21 UTC
    LW: 4 AF: 3
    AF Parent
    
    This is also how MIRI originally introduced the term. (I think they introduced it here, where they said “We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.”“)
    
    But that definition seems quite different from your “A is trying to do what H wants it to do.” For example, if H has a wrong understanding of his/her true or normative values and as a result wants A to do something that is actually harmful, then under your definition A would be still be “aligned” but under MIRI’s definition it wouldn’t be (because it wouldn’t be pursuing beneficial goals).
    
    This statement by Rob also makes me think that you wouldn’t in fact be happy with what he at least means by “alignment,” since I take it you explicitly mean to include moral philosophy?
    
    I think that’s right. When I say MIRI/Arbital definition of “alignment” I’m referring to what’s they’ve posted publicly, and I believe it does include moral philosophy. Rob’s statement that you quoted seems to be a private one (I don’t recall seeing it before and can’t find it through Google search) but I can certainly see how it muddies the waters from your perspective.
    
    Probably “alignment” / “value alignment” would be a better split of terms than “alignment” vs. “motivation”. “Value alignment” has traditionally been used with the de re reading, but I could clarify that I’m working on de dicto value alignment when more precision is needed
    
    This seems fine to me, if you could give the benefit of doubt as to when more precision is needed. I’m basically worried about this scenario: You or someone else writes something like “I’m cautiously optimistic about Paul’s work.” The reader recalls seeing you say that you work on “value alignment”. They match that to what they’ve read from MIRI about how aligned AI “reliably pursues beneficial goals”, and end up thinking that is easier than you’d intend, or think there is more disagreement between alignment researchers about the difficulty of the broader problem than there is actually is. If you could consistently say that the goal of your work is “de dicto value alignment” then that removes most of my worry.
    
    I guess I have an analogous question for you: do you currently have any objections to using “beneficial AI” as the broader term, and “AI alignment” as the narrower term?
    
    This actually seems best to me on the merits of the terms alone (i.e., putting historical usage aside), and I’d be fine with it if everyone could coordinate to switch to these terms/definitions.
    - paulfchristiano 27 Nov 2018 1:11 UTC
      LW: 2 AF: 1
      AF Parent
      But that definition seems quite different from your “A is trying to do what H wants it to do.” For example, if H has a wrong understanding of his/her true or normative values and as a result wants A to do something that is actually harmful, then under your definition A would be still be “aligned” but under MIRI’s definition it wouldn’t be (because it wouldn’t be pursuing beneficial goals).
      “Do what H wants me to do” seems to me to be an example of a beneficial goal, so I’d say a system which is trying to do what H wants it to do is pursuing a beneficial goals. It may also be pursuing subgoals which turn out to be harmful, if e.g. it’s wrong about what H wants or has other mistaken empirical beliefs. I don’t think anyone could be advocating the definition “pursues no harmful subgoals,” since that basically requires perfect empirical knowledge (it seems just as hard as never taking a harmful action). Does that seem right to you?
      I’ve been assuming that “reliably pursues beneficial goals” is weaker than the definition I proposed, but practically equivalent as a research goal.
      I’m basically worried about this scenario: You or someone else writes something like “I’m cautiously optimistic about Paul’s work.” The reader recalls seeing you say that you work on “value alignment”. They match that to what they’ve read from MIRI about how aligned AI “reliably pursues beneficial goals”, and end up thinking that is easier than you’d intend, or think there is more disagreement between alignment researchers about the difficulty of the broader problem than there is actually is. If you could consistently say that the goal of your work is “de dicto value alignment” then that removes most of my worry.
      I think it’s reasonable for me to be more careful about clarifying what any particular line of research agenda does or does not aim to achieve. I think that in most contexts that is going to require more precision than just saying “AI alignment” regardless of how the term was defined, I normally clarify by saying something like “an AI which is at least trying to help us get what we want.”
      This actually seems best to me on the merits of the terms alone (i.e., putting historical usage aside), and I’d be fine with it if everyone could coordinate to switch to these terms/definitions.
      My guess is that MIRI folks won’t like the “beneficial AI” term because it is too broad a tent. (Which is also my objection to the proposed definition of “AI alignment,” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.”) My sense is that if that were their position, then you would also be unhappy with their proposed usage of “AI alignment,” since you seem to want a broad tent that makes minimal assumptions about what problems will turn out to be important. Does that seem right?
      (They might also dislike “beneficial AI” because of random contingent facts about how it’s been used in the past, and so might want a different term with the same meaning.)
      My own feeling is that using “beneficial AI” to mean “AI that produces good outcomes in the world” is basically just using “beneficial” in accordance with its usual meaning, and this isn’t a case where a special technical term is needed (and indeed it’s weird to have a technical term whose definition is precisely captured by a single—different—word).
      - Wei Dai 27 Nov 2018 2:54 UTC
        LW: 4 AF: 2
        AF Parent
        
        “Do what H wants me to do” seems to me to be an example of a beneficial goal, so I’d say a system which is trying to do what H wants it to do is pursuing a beneficial goals. It may also be pursuing subgoals which turn out to be harmful, if e.g. it’s wrong about what H wants or has other mistaken empirical beliefs. I don’t think anyone could be advocating the definition “pursues no harmful subgoals,” since that basically requires perfect empirical knowledge (it seems just as hard as never taking a harmful action). Does that seem right to you?
        
        I guess both “reliable” and “beneficial” are matters of degree so “aligned” in the sense of “reliably pursues beneficial goals” is also a matter of degree. “Do what H wants A to do” would be a moderate degree of alignment whereas “Successfully figuring out and satisfying H’s true/normative values” would be a much higher degree of alignment (in that sense of alignment). Meanwhile in your sense of alignment they are at best equally aligned and the latter might actually be less aligned if H has a wrong idea of metaethics or what his true/normative values are and as a result trying to figure out and satisfy those values is not something that H wants A to do.
        
        I think that in most contexts that is going to require more precision than just saying “AI alignment” regardless of how the term was defined, I normally clarify by saying something like “an AI which is at least trying to help us get what we want.”
        
        That seems good too.
        
        My guess is that MIRI folks won’t like the “beneficial AI” term because it is too broad a tent. (Which is also my objection to the proposed definition of “AI alignment,” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.“) My sense is that if that were their position, then you would also be unhappy with their proposed usage of “AI alignment,” since you seem to want a broad tent that makes minimal assumptions about what problems will turn out to be important. Does that seem right?
        
        This paragraph greatly confuses me. My understanding is that someone from MIRI (probably Eliezer) wrote the Arbital article defining “AI alignment” as “overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world”, which satisfies my desire to have a broad tent term that makes minimal assumptions about what problems will turn out to be important. I’m fine with calling this “beneficial AI” instead of “AI alignment” if everyone can coordinate on this (but I don’t know how MIRI people feel about this). I don’t understand why you think ‘MIRI folks won’t like the “beneficial AI” term because it is too broad a tent’ given that someone from MIRI gave a very broad definition to “AI alignment”. Do you perhaps think that Arbital article was written by a non-MIRI person?
        
        paulfchristiano 27 Nov 2018 18:34 UTC
        LW: 3 AF: 1
        AF Parent
        “Do what H wants A to do” would be a moderate degree of alignment whereas “Successfully figuring out and satisfying H’s true/normative values” would be a much higher degree of alignment (in that sense of alignment).
        In what sense is that a more beneficial goal?
        “Successfully do X” seems to be the same goal as X, isn’t it?
        “Figure out H’s true/normative values” is manifestly a subgoal of “satisfy H’s true/normative values.” Why would we care about that except as a subgoal?
        So is the difference entirely between “satisfy H’s true/normative values” and “do what H wants”? Do you disagree with one of the previous two bullet points? Is the difference that you think “reliably pursues” implies something about “actually achieves”?
        If the difference is mostly between “what H wants” and “what H truly/normatively values”, then this is just a communication difficulty. For me adding “truly” or “normatively” to “values” is just emphasis and doesn’t change the meaning.
        I try to make it clear that I’m using “want” to refer to some hard-to-define idealization rather than some narrow concept, but I can see how “want” might not be a good term for this, I’d be fine using “values” or something along those lines if that would be clearer.
        (This is why I wrote:
        What H wants” is even more problematic than “trying.” Clarifying what this expression means, and how to operationalize it in a way that could be used to inform an AI’s behavior, is part of the alignment problem. Without additional clarity on this concept, we will not be able to build an AI that tries to do what H wants it to do.
        )
        Wei Dai 27 Nov 2018 19:33 UTC
        LW: 7 AF: 3
        AF Parent
        
        If the difference is mostly between “what H wants” and “what H truly/normatively values”, then this is just a communication difficulty. For me adding “truly” or “normatively” to “values” is just emphasis and doesn’t change the meaning.
        
        Ah, yes that is a big part of what I thought was the difference. (Actually I may have understood at some point that you meant “want” in an idealized sense but then forgot and didn’t re-read the post to pick up that understanding again.)
        
        ETA: I guess another thing that contributed to this confusion is your talk of values evolving over time, and of preferences about how they evolve, which seems to suggest that by “values” you mean something like “current understanding of values” or “interim values” rather than “true/normative values” since it doesn’t seem to make sense to want one’s true/normative values to change over time.
        
        I try to make it clear that I’m using “want” to refer to some hard-to-define idealization rather than some narrow concept, but I can see how “want” might not be a good term for this, I’d be fine using “values” or something along those lines if that would be clearer.
        
        I don’t think “values” is good either. Both “want” and “values” are commonly used words that typically (in everyday usage) mean something like “someone’s current understanding of what they want” or what I called “interim values”. I don’t see how you can expect people not to be frequently confused if you use either of them to mean “true/normative values”. Like the situation with de re / de dicto alignment, I suggest it’s not worth trying to economize on the adjectives here.
        
        Another difference between your definition of alignment and “reliably pursues beneficial goals” is that the latter has “reliably” in it which suggests more of a de re reading. To use your example “Suppose A thinks that H likes apples, and so goes to the store to buy some apples, but H really prefers oranges.” I think most people would call an A that correctly understands H’s preferences (and gets oranges) more reliably pursuing beneficial goals.
        
        Given this, perhaps the easiest way to reduce confusions moving forward is to just use some adjectives to distinguish your use of the words “want”, “values”, or “alignment” from other people’s.
        
        green_leaf 5 Jun 2019 14:32 UTC
        1 point
        Parent
        If the difference is mostly between “what H wants” and “what H truly/normatively values”, then this is just a communication difficulty. For me adding “truly” or “normatively” to “values” is just emphasis and doesn’t change the meaning.
        So “wants” means a want more general than an object-level desire (like wanting to buy oranges), and it already takes into account the possibility of H changing his mind about what he wants if H discovers that his wants contradict his normative values?
        If that’s right, how is this generalization defined? (E.g. The CEV was “what H wants in the limit of infinite intelligence, reasoning time and complete information”.)
        paulfchristiano 27 Nov 2018 18:43 UTC
        2 points
        Parent
        I don’t understand why you think ‘MIRI folks won’t like the “beneficial AI” term because it is too broad a tent’ given that someone from MIRI gave a very broad definition to “AI alignment”. Do you perhaps think that Arbital article was written by a non-MIRI person?
        I don’t really know what anyone from MIRI thinks about this issue. It was a guess based on (a) the fact that Rob didn’t like a number of possible alternative terms to “alignment” because they seemed to be too broad a definition, (b) the fact that virtually every MIRI usage of “alignment” refers to a much narrower class of problems than “beneficial AI” is usually taken to refer to, (c) the fact that Eliezer generally seems frustrated with people talking about other problems under the heading of “beneficial AI.”
        (But (c) might be driven by powerful AI vs. nearer-term concerns / all the other empirical errors Eliezer thinks people are making, (b) isn’t that indicative, and (a) might be driven by other cultural baggage associated with the term / Rob was speaking off the cuff and not attempting to speak formally for MIRI.)
        I’d consider it great if we standardized on “beneficial AI” to mean “AI that has good consequences” and “AI alignment” to refer to the narrower problem of aligning AI’s motivation/preferences/goals.
- paulfchristiano 26 Nov 2018 22:00 UTC
  LW: 2 AF: 1
  AF Parent
  I don’t understand how this is compatible with only 2% loss from value drift/corruption. Do you perhaps think the actual loss is much bigger, but almost certainly we just can’t do anything about it, so 2% is how much you expect we can potentially “save” from value drift/corruption? Or are you taking an anti-realist position and saying something like, if someone doesn’t care about averting drift/corruption, then however their values drift that doesn’t constitute any loss?
  10x worse was originally my estimate for cost-effectiveness, not for total value at risk.
  People not caring about X prima facie decreases the returns to research on X. But may increase the returns for advocacy (or acquiring resources/influence, or more creative interventions). That bullet point was really about the returns to research.
  - Wei Dai 27 Nov 2018 0:46 UTC
    LW: 4 AF: 2
    AF Parent
    
    People not caring about X prima facie decreases the returns to research on X. But may increase the returns for advocacy (or acquiring resources/influence, or more creative interventions). That bullet point was really about the returns to research.
    
    It’s not obvious that applies here. If people don’t care strongly about how their values evolve over time, that seemingly gives AIs / AI designers an opening to have greater influence over how people’s values evolve over time, and implies a larger (or at least not obviously smaller) return on research into how to do this properly. Or if people care a bit about protecting their values from manipulation from other AIs but not a lot, it seems really important/valuable to reduce the cost of such protection as much as possible.
    
    As for advocacy, it seems a lot easier (at least for someone in my position) to convince a relatively small number of AI designers to build AIs that want to help their users evolve their values in a positive way (or figuring out what their true or normative values are, or protecting their values against manipulation), than to convince all the potential users to want that themselves.
    - paulfchristiano 28 Nov 2018 2:00 UTC
      LW: 2 AF: 1
      AF Parent
      I agree that:
      If people care less about some aspect of the future, then trying to get influence over that aspect of the future is more attractive (whether by building technology that they accept as a default, or by making an explicit trade, or whatever).
      A better understanding of how to prevent value drift can still be helpful if people care a little bit, and can be particularly useful to the people who care a lot (and there will be fewer people working to develop such understanding if few people care).
      I think that both
      (a) Trying to have influence over aspects of value change that people don’t much care about, and
      (b) better understanding the important processes driving changes in values
      are reasonable things to do to make the future better. (Though some parts of (a) especially are somewhat zero-sum and I think it’s worth being thoughtful about that.)
      (I don’t agree with the sign of the effect described in your comment, but don’t think it’s an important point / may just be a disagreement about what else we are holding equal so it seems good to drop.)
      - Vladimir_Nesov 28 Nov 2018 4:08 UTC
        2 points
        Parent
        
        Trying to have influence over aspects of value change that people don’t much care about … [is] reasonable … to do to make the future better
        
        This could refer to value change in AI controllers, like Hugh in HCH, or alternatively to value change in people living in the AI-managed world. I believe the latter could be good, but the former seems very questionable (here “value” refers to true/normative/idealized preference). So it’s hard for the same people to share the two roles. How do you ensure that value change remains good in the original sense without a reference to preference in the original sense, that hasn’t experienced any value change, a reference that remains in control? And for this discussion, it seems like the values of AI controllers (or AI+controllers) is what’s relevant.
        
        It’s agent tiling for AI+controller agents, any value change in the whole seems to be a mistake. It might be OK to change values of subagents, but the whole shouldn’t show any value drift, only instrumentally useful tradeoffs that sacrifice less important aspects of what’s done for more important aspects, but still from the point of view of unchanged original values (to the extent that they are defined at all).