paulfchristiano comments on OpenAI’s Alignment Plan is not S.M.A.R.T.

paulfchristiano 18 Jan 2023 19:51 UTC
27 points
7
The goal of alignment is that the AI does not kill everyone
It’s worth pointing out that there was no time when alignment meant “AI doesn’t kill everyone:”
- I first encountered the term “alignment” as part of Stuart Russell’s “value alignment” (e.g. here) by which he means something like “aligning the utility function of your AI with the values of the human race.” This is clearly broader than not killing everyone.
- As far as I know MIRI first used the term in this paper which defined “aligned” as “reliably pursues beneficial goals” (though I think they only defined it for a smarter than human AI). This is also broader than not killing everyone.
- I used to say “AI control” (and blogged at ai-control.com) to mean “getting an AI to try to do what you want it to do.” In 2017 I switched to using AI alignment (and moved to ai-alignment.com) at the suggestion of Rob Bensinger and MIRI, who proposed using “alignment” as a synonym for ‘Bostrom’s “control problem.”’ The control problem is defined in Superintelligence as the principal agent problem between a human and the AI system they built, which is broader than not killing everyone. I have tried to offer a more precise definition of how I use the term AI alignment, as meaning “figuring out how to build AI systems that are trying to do what their operators want them to do.”
- Eliezer has used AI alignment (later than Russell AFAIK) to mean the whole area of research relevant to building sufficiently advanced AIs such that “running them produces good outcomes in the real world.” This makes it an incredibly complicated empirical and normative question what counts as alignment, and AFAICT practically anything might count. I think this is an absolutely terrible definition. You should define the problem you want to work on, not define alignment as “whatever really actually matters” and then argue that empirically the technical problems you care about are the ones that really actually matter. That’s quite literally an invitation to argue about what the term refers to. I still honestly find it hard to believe that people at MIRI considered this a reasonable way of defining and using the term.
So I would say the thing you are describing as “the goal” and “focus” of alignment is just a special case that you care a lot about. (I also care a lot about this problem! See discussion in AI alignment is distinct from its near-term applications.) This isn’t a case of a term being used in a pure and clear way by one community and then co-opted or linguistically corrupted by another; I think it’s a case of a community being bad at defining and using terms, equivocating about definitions, and smuggling complicated empirical claims into proposed “definitions.” I’ve tried to use the term in a consistent way over the last 6 years.
I think it’s reasonable to use “AI safety” to refer to reducing the risk of negative impacts from AI and “AI existential safety” to refer to reducing the risk of existential catastrophes from AI.
I am sympathetic to the recent amusing proposal of “AI notkilleveryoneism” for the particular area of AI existential safety that’s about reducing the risk that your AI deliberately kills everyone. (Though I find Eliezer’s complaints about linguistic drift very unsympathetic.) I usually just describe the goal in a longer way like “figuring out how to build an AI that won’t deliberately kill us” and then have shorter words for particular technical problems that I believe are relevant to that goal (like alignment, robustness, interpretability...)
(Sorry for a bit of a rant, but I’ve seen a lot of people complaining about this in ways I disagree with. OP happened to be the post that I replied to.)
- Søren Elverlin 19 Jan 2023 8:34 UTC
  1 point
  0
  Parent
  Bostroms definition of the control problem in ‘Superintelligence’ only refer to “harming the projects interests”, which you are right is broader than existential risk. However, the immediate context makes it clear that Bostrom is discussing existential risk. The “harm” referred to does not include things like gender bias.
  
  On reflection, I don’t actually believe that AI Alignment has ever exclusively referred to existential risk from AI. I do believe that talk about “AI Alignment” on LessWrong has usually primarily been about existential risk. I further think that the distinction from “Value Alignment” (and if that is related to existential risk) has been muddled and debated.
  
  I think the term “The Alignment Problem” is used because this community agrees that one problem (not killing everyone) is far and away more central than the rest (e.g. designing an AI to refuse to tell you how to make drugs).
  
  Apart from the people here from OpenAI/DeepMind/etc, I expect general agreement that the task “Getting GPT to better understand and follow instructions” is not AI Alignment, but AI Capability. Note that I am moving my goalpost from defending the claim “AI Alignment = X-Risk” to defending “Some of the things OpenAI call AI Alignment is not AI Alignment”.
  
  At this point I should repeat my disclaimer that all of this is my impression, and not backed by anything rigorous. Thank you for engaging anyway—I enjoyed your “rant”.
  - paulfchristiano 19 Jan 2023 18:20 UTC
    2 points
    0
    Parent
    The control problem is initially introduced as: “the problem of how to control what the superintelligence would do.” In the chapter you reference it is presented as the principal agent problem that occurs between a human and the superintelligent AI they build (apparently the whole of that problem).
    It would be reasonable to say that there is no control problem for modern AI because Bostrom’s usage of “the control problem” is exclusively about controlling superintelligence. On this definition either there is no control research today, or it comes back to the implicit controversial empirical claim about how some work is relevant and other work is not.
    If you are teaching GPT to better understand instructions I would also call that improving its capability (though some people would call it alignment, this is the de dicto vs de re distinction discussed here). If it already understands instructions and you are training it to follow them, I would call that alignment.
    I think you can use AI alignment however you want, but this is a lame thing to get angry at labs about and you should expect ongoing confusion.