I think you’re saying there will be nonzero drift, and that’s a possible problem. I agree. I just don’t think it’s likely to be a disastrous problem.
That post, on “minutes from a human alignment meeting”, is addressing something I think of as important but different than drift: value mis-specification, or equivalently, value mis-generalization. That could be a huge problem without drift playing a role, and vice versa; I think they’re pretty seperable.
I wasn’t trying to say what you took me to. I just meant that when each AI creates a successor, it has to solve the alignment problem again. I don’t think there will be zero drift at any point, just little enough to count as success. If an AGI cares about following instructions from a designated human, it could quite possibly create a successor that also cares about following instructions from that human. That’s potentially good enough alignment to makes humans lives a lot better and prevents their extinction. Each successor might have slightly different values in other areas from drift, but that would be okay if the largest core motivation stays approximately the same.
So I think the important question is how much drift and how close does the value match need to be. I tried to find all of the work/thinking on the question of how close a value match needs to be. But exactly how complex and fragile? addresses that but the discussion doesn’t get far and nobody references other work, so I think we just don’t know and need to work that out.
I used the example of following human instructions because that also provides some amount of a basin of attraction for alignment, so that close-enough-is-good-enough. But even without that, I think it’s pretty likely that reflective stability provides enough compensation for drift to essentially work and provide good-enough alignment indefinitely.
But again, it’s a question of how much drift and how much stabilization/alignment each AGI can perform.
I just don’t think it’s likely to be a disastrous problem.
But again, it’s a question of how much drift and how much stabilization/alignment each AGI can perform.
I’m not sure what exactly your intuition on this coming from, but you’re wrong here, and I’m afraid it’s not a matter of my opinion. But I guess we’ll have to agree to disagree.
I think you’re saying there will be nonzero drift, and that’s a possible problem. I agree. I just don’t think it’s likely to be a disastrous problem.
That post, on “minutes from a human alignment meeting”, is addressing something I think of as important but different than drift: value mis-specification, or equivalently, value mis-generalization. That could be a huge problem without drift playing a role, and vice versa; I think they’re pretty seperable.
I wasn’t trying to say what you took me to. I just meant that when each AI creates a successor, it has to solve the alignment problem again. I don’t think there will be zero drift at any point, just little enough to count as success. If an AGI cares about following instructions from a designated human, it could quite possibly create a successor that also cares about following instructions from that human. That’s potentially good enough alignment to makes humans lives a lot better and prevents their extinction. Each successor might have slightly different values in other areas from drift, but that would be okay if the largest core motivation stays approximately the same.
So I think the important question is how much drift and how close does the value match need to be. I tried to find all of the work/thinking on the question of how close a value match needs to be. But exactly how complex and fragile? addresses that but the discussion doesn’t get far and nobody references other work, so I think we just don’t know and need to work that out.
I used the example of following human instructions because that also provides some amount of a basin of attraction for alignment, so that close-enough-is-good-enough. But even without that, I think it’s pretty likely that reflective stability provides enough compensation for drift to essentially work and provide good-enough alignment indefinitely.
But again, it’s a question of how much drift and how much stabilization/alignment each AGI can perform.
I’m not sure what exactly your intuition on this coming from, but you’re wrong here, and I’m afraid it’s not a matter of my opinion. But I guess we’ll have to agree to disagree.