Seth Herd comments on The Problem With the Word ‘Alignment’

Seth Herd 22 May 2024 0:36 UTC
5 points
0
I think you’re right about these drawbacks of using the term “alignment” so broadly. And I agree that more work and attention should be devoted to specifying how we suppose these concepts relate to each other. In my experience, far too little effort is devoted to placing scientific work within its broader context. We cannot afford to waste effort in working on alignment.
I don’t see a better alternative, nor do you suggest one. My preference in terminology is to simply use more specification, rather than trying to get anyone to change the terminology they use. With that in mind, I’ll list what I see as the most common existing terminology for each of the sub-problems.
P1: Avoiding takeover from emergent optimization in AI agents
Best term in use: AInotkilleveryoneism. I disagree that alignment is commonly misused for this.
I don’t think I’ve heard these termed alignment, outside of the assumption you mention in the Berkeley model of value alignment (P5) as the only way of avoiding takeover (P1). P1 has been termed “the control problem” which encompasses value alignment. Which is good. This does not fit the intuitive definition of alignment. The deliberately clumsy term “AInotkilleveryoneism” seems good for this, in any context you can get away with it. Your statement seems good otherwise.
P2: Ensuring that AI’s information processing (and/or reasoning) is intelligible to us
Best term in use: interpretability
This is more commonly called interpretability, but I agree that it’s commonly lumped into “alignment work” without carefully examining just how it fits in. But it does legitimately fit into P1 (which shouldn’t be called alignment, as well as (what I think you mean by) P3, P5, and P6, which do fit the intuitive meaning of “alignment.” Thus, it does seem like this deserves the term “alignment work” as well as its more precise term of interpretability. So this seems about right, with the caveat of wanting more specificity. As it happens, I just now published a post on exactly this.
P3: Ensuring AIs are good at solving problems as specified (by user or designer)
Best term in use: None. AI safety?
I think you mean to include ensuring AIs also do not do things their designers don’t want. I suggest changing your description, since that effort is more often called alignment and accused of safety-washing.
This is the biggest offender. The problem is that “alignment is intuitively appealing. I’d argue that this is completely wrong: you can’t align a system with goals (humans) with a tool without goals (LLMs). A sword or an axe are not aligned with their wielders; they certainly lead to more trees cut down and people stabbed, but they do not intend those things, so there’s a type error in saying they are aligned with their users goals.
But this is pedantry that will continue to be ignored. I don’t have a good idea for making this terminology clear. The term AGI at one point was used to specify AI with agency and goals, and thus which would be alignable with human goals, but it’s been watered down. We need a replacement. And we need a better term for “aligning” AIs that are not at all dangerous in the severe way the “alignment problem” terminology was intended to address. Or a different term for doing the important work of aligning agentic, RSI-capable AGI.
P4: Ensuring AI systems enhance, and don’t erode, human agency
What? I’d drop this and just consider it a subset of P6. Maybe this plays a bigger role and gets the term alignment more than I know? Do you have examples?
P5: Ensuring that advanced AI agents learn a human utility function
Best term in use: value alignment OR technical alignment.
I think these deserve their own categories in your terminology, because they overlap—technical alignment could be limited to making AGIs that follow instructions. I have been thinking about this a lot. I agree with your analysis that this is what people will probably do, for economic reasons; but I think there are powerful practical reasons that this is much easier than full value alignment, which will be a valuable excuse to align it to follow instructions from its creators. I recently wrote up that logic. This conclusion raises another problem that I think deserves to join the flock of related alignment problems: the societal alignment problem. If some humans have AGIs aligned to their values (likely through their intent/instructions), how can we align society to avoid resulting disasters from AGI-powered conflict?
P6: Ensuring that AI systems lead to desirable systemic and long term outcomes
Best term in use: I don’t think there is one. Any ideas?
- quiet_NaN 23 May 2024 10:40 UTC
  1 point
  0
  Parent
  The deliberately clumsy term “AInotkilleveryoneism” seems good for this, in any context you can get away with it.
  Hard disagree. The position “AI might kill all humans in the near future” is still quite some inferential distance away from the mainstream even if presented in a respectable academic veneer.
  We do not have weirdness points to spend on deliberately clumsy terms, even on LW. Journalists (when they are not busy doxxing people) can read LW too, and if they read that the worry about AI as an extinction risk is commonly called notkilleveryoneism they are orders of magnitude less likely to take us serious, and being taken serious by the mainstream might be helpful for influencing policy.
  We could probably get away with using that term ten pages deep into some glowfic, but anywhere else ‘AI as an extinction risk’ seems much better.
  - Seth Herd 28 May 2024 0:08 UTC
    2 points
    0
    Parent
    I think you’re right. Unfortunately I’m not sure “AI as an extinction risk” is much better. It’s still a weird thing to posit, by standard intuitions.