Seth Herd comments on Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours

Seth Herd 7 Aug 2024 17:31 UTC
5 points
0
Alignment/value drift is definitely something I’m concerned about.

I wrote about it in a paper, Goal changes in intelligent agents and a post, The alignment stability problem.

But those are more about the problem. My thinking has come around to thinking that reflective stability is probably enough to counteract value drift. But it’s not guaranteed by any means.

Value drift will happen, but the question is how much? The existing agent will try to give successors the same alignment/goals it has (or preserve its own goals if it’s learning or otherwise self-modifying).

So there are two forces at work: an attempt to maintain alignment by the agent itself, and an accidental drift away from those values. The question is how much drift happens in the sum of those two forces.

If we’re talking about successors, it’s exactly solving the alignment problem again. I’d expect AGI to be better at that if it’s overall smarter/more cognitively competent than humans. If it’s not, I wouldn’t trust it to solve that problem alone, and I’d want humans involved. That’s why I call my alignment proposal “do what I mean and check”, DWIMAC as a variant of instruction-following; I wouldn’t want a parahuman-level AGI doing important things (like aligning a successor) without consulting closely with its creators before acting.

Once it’s smarter than human, I’d expect its alignment attempts to be good enough to largely succeed, even though some small amount of drift/imperfections seems inevitable.

If we need a totally precise value alignment for success, that wouldn’t work. But it seems like there are a variety of outcomes we’d find quite good, so the match doesn’t need to be perfect; there’s room for some drift.

So this is a complex issue, but I don’t think it’s probably a showstopper. But it’s another question that deserves more thought before we launch a real AGI that learns, self-improves, and helps design successors.
- bhauth 7 Aug 2024 18:11 UTC
  5 points
  3
  Parent
  My thinking has come around to thinking that reflective stability is probably enough to counteract value drift.
  
  I’m confident that’s wrong. (I also think you overestimate the stability of human values because you’re not considering the effect of stability of cultural environment.)
  
  Why?
  1. Consider how AutoGPT works. It spawns new processes that handle subtasks. But those subtasks are never perfectly aligned with the original task.
  2. Again, it only works to a limited extent in humans.
  it’s exactly solving the alignment problem again. I’d expect AGI to be better at that if it’s overall smarter/more cognitively competent than humans
  
  That’s not the right way of thinking about it. There isn’t some threshold where you “solve the alignment problem” completely and then all future RSI has zero drift. All you can do is try to improve how well it’s solved under certain circumstances. As the child system gets smarter the problem is different and more difficult. That’s why you get value drift at each step.
  
  see also this post
  - Seth Herd 7 Aug 2024 21:39 UTC
    2 points
    0
    Parent
    I think you’re saying there will be nonzero drift, and that’s a possible problem. I agree. I just don’t think it’s likely to be a disastrous problem.
    
    That post, on “minutes from a human alignment meeting”, is addressing something I think of as important but different than drift: value mis-specification, or equivalently, value mis-generalization. That could be a huge problem without drift playing a role, and vice versa; I think they’re pretty seperable.
    
    I wasn’t trying to say what you took me to. I just meant that when each AI creates a successor, it has to solve the alignment problem again. I don’t think there will be zero drift at any point, just little enough to count as success. If an AGI cares about following instructions from a designated human, it could quite possibly create a successor that also cares about following instructions from that human. That’s potentially good enough alignment to makes humans lives a lot better and prevents their extinction. Each successor might have slightly different values in other areas from drift, but that would be okay if the largest core motivation stays approximately the same.
    
    So I think the important question is how much drift and how close does the value match need to be. I tried to find all of the work/thinking on the question of how close a value match needs to be. But exactly how complex and fragile? addresses that but the discussion doesn’t get far and nobody references other work, so I think we just don’t know and need to work that out.
    
    I used the example of following human instructions because that also provides some amount of a basin of attraction for alignment, so that close-enough-is-good-enough. But even without that, I think it’s pretty likely that reflective stability provides enough compensation for drift to essentially work and provide good-enough alignment indefinitely.
    
    But again, it’s a question of how much drift and how much stabilization/alignment each AGI can perform.
    - bhauth 7 Aug 2024 22:48 UTC
      −2 points
      −2
      Parent
      
      I just don’t think it’s likely to be a disastrous problem.
      
      But again, it’s a question of how much drift and how much stabilization/alignment each AGI can perform.
      
      I’m not sure what exactly your intuition on this coming from, but you’re wrong here, and I’m afraid it’s not a matter of my opinion. But I guess we’ll have to agree to disagree.