Wei Dai comments on Clarifying “AI Alignment”

Wei Dai 23 Nov 2018 4:16 UTC
LW: 6 AF: 4
AF
Note that Arbital defines “AI alignment” as:

The “alignment problem for advanced agents” or “AI alignment” is the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.

and “total alignment” as:

An advanced agent can be said to be “totally aligned” when it can assess the exact value of well-described outcomes and hence the exact subjective value of actions, policies, and plans; where value has its overridden meaning of a metasyntactic variable standing in for “whatever we really do or really should value in the world or want from an Artificial Intelligence” (this is the same as “normative” if the speaker believes in normativity).

I think this clearly includes the kinds of problems I’m talking about in this thread. Do you agree? Also supporting my view is the history of “Friendliness” being a term that included the problem of better understanding the user’s values (as in CEV) and then MIRI giving up that term in favor of “alignment” as an apparently exact synonym. See this MIRI post which talks about “full alignment problem for fully autonomous AGI systems” and links to Arbital.

In practice, essentially all of MIRI’s work seems to fit within this narrower definition, so I’m not too concerned at the moment with this practical issue

I think you may have misunderstood what I meant by “practical issue”. My point was that if you say something like “I think AI alignment is the most urgent problem to work on” the listener could easily misinterpret you as meaning “alignment” in the MIRI/Arbital sense. Or if I say “AI alignment is the most urgent problem to work on” in the MIRI/Arbital sense of alignment, the listener could easily misinterpret as meaning “alignment” your sense.

Again my feeling is that MIRI started using alignment in the broader sense first and therefore that definition ought to have priority. If you disagree with this, I could try to do some more historical research to show this. (For example by figuring out when those Arbital articles were written, which I currently don’t know how to do.)
- paulfchristiano 23 Nov 2018 20:52 UTC
  LW: 4 AF: 2
  AF Parent
  Again my feeling is that MIRI started using alignment in the broader sense first and therefore that definition ought to have priority. If you disagree with this, I could try to do some more historical research to show this. (For example by figuring out when those Arbital articles were written, which I currently don’t know how to do.)
  I think MIRI’s first use of this term was here where they said “We call a smarter-than-human system that reliably pursues beneficial goals `aligned with human interests’ or simply `aligned.′ ” which is basically the same as my definition. (Perhaps slightly weaker, since “do what the user wants you to do” is just one beneficial goal.) This talk never defines alignment, but the slide introducing the big picture says “Take-home message: We’re afraid it’s going to be technically difficult to point AIs in an intuitively intended direction” which also really suggests it’s about trying to point your AI in the right direction.
  The actual discussion on that Arbital page strongly suggests that alignment is about pointing an AI in a direction, though I suppose that may merely be an instance of suggestively naming the field “alignment” and then defining it to be “whatever is important” as a way of smuggling in the connotation that pointing your AI in the right direction is the important thing. All of the topics in the “AI alignment” domain (except for mindcrime, which is borderline) all fit under the narrower definition; the list of alignment researchers are all people working on the narrower problem.
  So I think the way this term is used in practice basically matches this narrower definition.
  As I mentioned, I was previously happily using the term “AI control.” Rob Bensinger suggested that I stop using that term and instead use AI alignment, proposing a definition of alignment that seemed fine to me.
  I don’t think the very broad definition is what almost anyone has in mind when they talk about alignment. It doesn’t seem to be matching up with reality in any particular way, except insofar as its capturing the problems that a certain group of people work on.” I don’t really see any argument in favor except the historical precedent, which I think is dubious in light of all of the conflicting definitions, the actual usage, and the explicit move to standardize on “alignment” where an alternative definition was proposed.
  (In the discussion, the compromise definition suggested was “cope with the fact that the AI is not trying to do what we want it to do, either by aligning incentives or by mitigating the effects of misalignment.”)
  The “alignment problem for advanced agents” or “AI alignment” is the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.
  Is this intended (/ do you understand this) to include things like “make your AI better at predicting the world,” since we expect that agents who can make better predictions will achieve better outcomes?
  If this isn’t included, is that because “sufficiently advanced” includes making good predictions? Or because of the empirical view that ability to predict the world isn’t an important input into producing good outcomes? Or something else?
  If this definition doesn’t distinguish alignment from capabilities, then that seems like a non-starter to me which is neither useful nor captures the typical usage.
  If this excludes making better prediction because that’s assumed by “sufficiently advanced agent,” then I have all sorts of other questions (does “sufficiently advanced” include all particular empirical knowledge relevant to making the world better? does it include some arbitrary category not explicitly carved out in the definition?)
  In general, the alternative broader usage of AI alignment is broad enough to capture lots of problems that would exist whether or not we built AI. That’s not so different from using the term to capture (say) physics problems that would exist whether or not we built AI, both feel bad to me.
  Independently of this issue, it seems like “the kinds of problems you are talking about in this thread” need better descriptions whether or not they are part of alignment (since even if they are part of alignment, they will certainly involve totally different techniques/skills/impact evaluations/outcomes/etc.).
  - Wei Dai 23 Nov 2018 22:38 UTC
    LW: 4 AF: 2
    AF Parent
    
    The actual discussion on that Arbital page strongly suggests that alignment is about pointing an AI in a direction
    
    But the page includes:
    
    “AI alignment theory” is meant as an overarching term to cover the whole research field associated with this problem, including, e.g., the much-debated attempt to estimate how rapidly an AI might gain in capability once it goes over various particular thresholds.
    
    which seems to be outside of just “pointing an AI in a direction”
    
    Is this intended (/ do you understand this) to include things like “make your AI better at predicting the world,” since we expect that agents who can make better predictions will achieve better outcomes?
    
    I think so, at least for certain kinds of predictions that seem especially important (i.e., may lead to x-risk if done badly), see this Arbital page which is under AI Alignment:
    
    Vingean reflection is reasoning about cognitive systems, especially cognitive systems very similar to yourself (including your actual self), under the constraint that you can’t predict the exact future outputs. We need to make predictions about the consequence of operating an agent in an environment via reasoning on some more abstract level, somehow.
    
    If this definition doesn’t distinguish alignment from capabilities, then that seems like a non-starter to me which is neither useful nor captures the typical usage.
    
    It seems to me that Rohin’s proposal of distinguishing between “motivation” and “capabilities” is a good one, and then we can keep using “alignment” for the set of broader problems that are in line with the MIRI/Arbital definition and examples.
    
    In general, the alternative broader usage of AI alignment is broad enough to capture lots of problems that would exist whether or not we built AI. That’s not so different from using the term to capture (say) physics problems that would exist whether or not we built AI, both feel bad to me.
    
    It seems fine to me to include 1) problems that are greatly exacerbated by AI and 2) problems that aren’t caused by AI but may be best solved/ameliorated by some element of AI design, since these are problems that AI researchers have a responsibility over and/or can potentially contribute to. If there’s a problem that isn’t exacerbated by AI and does not seem likely to have a solution within AI design then I’d not include that.
    
    Independently of this issue, it seems like “the kinds of problems you are talking about in this thread” need better descriptions whether or not they are part of alignment (since even if they are part of alignment, they will certainly involve totally different techniques/skills/impact evaluations/outcomes/etc.).
    
    Sure, agreed.