Steven Byrnes comments on A (EtA: quick) note on terminology: AI Alignment != AI x-safety

Steven Byrnes 9 Feb 2023 15:41 UTC
LW: 2 AF: 2
0
AF
Nice post.
I’m open-minded, but wanted to write out what I’ve been doing as a point of comparison & discussion. Here’s my terminology as of this writing:
- Green box ≈ “AGI safety”
- Purple box ≈ “AGI alignment”
- Brown box ≈ “Safe & Beneficial AGI”, or “Avoiding AGI x-risk”, or “getting to an awesome post-AGI utopia”, or things like that.
This has one obvious unintuitive aspect, and I discuss it in footnote 2 here—
By this definition of “safety”, if an evil person wants to kill everyone, and uses AGI to do so, that still counts as successful “AGI safety”. I admit that this sounds rather odd, but I believe it follows standard usage from other fields: for example, “nuclear weapons safety” is a thing people talk about, and this thing notably does NOT include the deliberate, authorized launch of nuclear weapons, despite the fact that the latter would not be “safe” for anyone, not by any stretch of the imagination. Anyway, this is purely a question of definitions and terminology. The problem of people deliberately using AGI towards dangerous ends is a real problem, and I am by no means unconcerned about it. I’m just not talking about in this particular series. See Post 1, Section 1.2.
I haven’t personally been using the term “AI existential safety”, but using it for the brown box seems pretty reasonable to me.
For the purple box, there’s a use-mention issue, I think? Copying from my footnote 3 here:
Some researchers think that the “correct” design intentions (for an AGI’s motivation) are obvious, and define the word “alignment” accordingly. Three common examples are (1) “I am designing the AGI so that, at any given point in time, it’s trying to do what its human supervisor wants it to be trying to do”—this AGI would be “aligned” to the supervisor’s intentions. (2) “I am designing the AGI so that it shares the values of its human supervisor”—this AGI would be “aligned” to the supervisor. (3) “I am designing the AGI so that it shares the collective values of humanity”—this AGI would be “aligned” to humanity.
I’m avoiding this approach because I think that the “correct” intended AGI motivation is still an open question. For example, maybe it will be possible to build an AGI that really just wants to do a specific, predetermined, narrow task (e.g. design a better solar cell), in a way that doesn’t involve taking over the world etc. Such an AGI would not be “aligned” to anything in particular, except for the original design intention. But I still want to use the term “aligned” when talking about such an AGI.
Of course, sometimes I want to talk about (1,2,3) above, but I would use different terms for that purpose, e.g. (1) “the Paul Christiano version of corrigibility”, (2) “ambitious value learning”, and (3) “CEV”.
(I could have also said “intent alignment” for (1), I think.)
- David Scott Krueger (formerly: capybaralet) 10 Feb 2023 9:39 UTC
  LW: 4 AF: 3
  0
  AF Parent
  I don’t think we should try and come up with a special term for (1).
  The best term might be “AI engineering”. The only thing it needs to be distinguished from is “AI science”.
  
  I think ML people overwhelmingly identify as doing one of those 2 things, and find it annoying and ridiculous when people in this community act like we are the only ones who care about building systems that work as intended.