My thinking has come around to thinking that reflective stability is probably enough to counteract value drift.
I’m confident that’s wrong. (I also think you overestimate the stability of human values because you’re not considering the effect of stability of cultural environment.)
Why?
Consider how AutoGPT works. It spawns new processes that handle subtasks. But those subtasks are never perfectly aligned with the original task.
Again, it only works to a limited extent in humans.
it’s exactly solving the alignment problem again. I’d expect AGI to be better at that if it’s overall smarter/more cognitively competent than humans
That’s not the right way of thinking about it. There isn’t some threshold where you “solve the alignment problem” completely and then all future RSI has zero drift. All you can do is try to improve how well it’s solved under certain circumstances. As the child system gets smarter the problem is different and more difficult. That’s why you get value drift at each step.
I think you’re saying there will be nonzero drift, and that’s a possible problem. I agree. I just don’t think it’s likely to be a disastrous problem.
That post, on “minutes from a human alignment meeting”, is addressing something I think of as important but different than drift: value mis-specification, or equivalently, value mis-generalization. That could be a huge problem without drift playing a role, and vice versa; I think they’re pretty seperable.
I wasn’t trying to say what you took me to. I just meant that when each AI creates a successor, it has to solve the alignment problem again. I don’t think there will be zero drift at any point, just little enough to count as success. If an AGI cares about following instructions from a designated human, it could quite possibly create a successor that also cares about following instructions from that human. That’s potentially good enough alignment to makes humans lives a lot better and prevents their extinction. Each successor might have slightly different values in other areas from drift, but that would be okay if the largest core motivation stays approximately the same.
So I think the important question is how much drift and how close does the value match need to be. I tried to find all of the work/thinking on the question of how close a value match needs to be. But exactly how complex and fragile? addresses that but the discussion doesn’t get far and nobody references other work, so I think we just don’t know and need to work that out.
I used the example of following human instructions because that also provides some amount of a basin of attraction for alignment, so that close-enough-is-good-enough. But even without that, I think it’s pretty likely that reflective stability provides enough compensation for drift to essentially work and provide good-enough alignment indefinitely.
But again, it’s a question of how much drift and how much stabilization/alignment each AGI can perform.
I just don’t think it’s likely to be a disastrous problem.
But again, it’s a question of how much drift and how much stabilization/alignment each AGI can perform.
I’m not sure what exactly your intuition on this coming from, but you’re wrong here, and I’m afraid it’s not a matter of my opinion. But I guess we’ll have to agree to disagree.
I’m confident that’s wrong. (I also think you overestimate the stability of human values because you’re not considering the effect of stability of cultural environment.)
Why?
Consider how AutoGPT works. It spawns new processes that handle subtasks. But those subtasks are never perfectly aligned with the original task.
Again, it only works to a limited extent in humans.
That’s not the right way of thinking about it. There isn’t some threshold where you “solve the alignment problem” completely and then all future RSI has zero drift. All you can do is try to improve how well it’s solved under certain circumstances. As the child system gets smarter the problem is different and more difficult. That’s why you get value drift at each step.
see also this post
I think you’re saying there will be nonzero drift, and that’s a possible problem. I agree. I just don’t think it’s likely to be a disastrous problem.
That post, on “minutes from a human alignment meeting”, is addressing something I think of as important but different than drift: value mis-specification, or equivalently, value mis-generalization. That could be a huge problem without drift playing a role, and vice versa; I think they’re pretty seperable.
I wasn’t trying to say what you took me to. I just meant that when each AI creates a successor, it has to solve the alignment problem again. I don’t think there will be zero drift at any point, just little enough to count as success. If an AGI cares about following instructions from a designated human, it could quite possibly create a successor that also cares about following instructions from that human. That’s potentially good enough alignment to makes humans lives a lot better and prevents their extinction. Each successor might have slightly different values in other areas from drift, but that would be okay if the largest core motivation stays approximately the same.
So I think the important question is how much drift and how close does the value match need to be. I tried to find all of the work/thinking on the question of how close a value match needs to be. But exactly how complex and fragile? addresses that but the discussion doesn’t get far and nobody references other work, so I think we just don’t know and need to work that out.
I used the example of following human instructions because that also provides some amount of a basin of attraction for alignment, so that close-enough-is-good-enough. But even without that, I think it’s pretty likely that reflective stability provides enough compensation for drift to essentially work and provide good-enough alignment indefinitely.
But again, it’s a question of how much drift and how much stabilization/alignment each AGI can perform.
I’m not sure what exactly your intuition on this coming from, but you’re wrong here, and I’m afraid it’s not a matter of my opinion. But I guess we’ll have to agree to disagree.