I suppose I’m obligated to reply after I posted what I did, so, looking at “We have promising alignment plans with low taxes”:
The system design of “Learning & Steering” has very high “taxes”. It doesn’t work with fully recursive self-improvement. With linear chains of self-improvement, looking at humans indicates that it works less well the more you do.
I’m 0% worried about AutoGPT. RSI requires self-modification requires control of (something like) training. In that case, “Internal independent review for language model agent alignment” doesn’t work.
The Natural Abstraction Hypothesis is wrong when comparing different levels of intelligence. Bees and humans both have concepts for distance, but bees don’t have a concept for electrical resistance. So “Just Retarget The Search” doesn’t work for ASI.
Thanks for taking a look! You’re not obligated to pursue this farther, although I do really want to get some skeptics to fully understand these proposals to poke holes in them.
I don’t think any of these are foolproof, but neither are the guaranteed or even likely to fail AFAICT from the critical analysis people have offered so far. To your specific points (which I am probably misunderstanding in places, sorry for that):
The system design of “Learning & Steering” has very high “taxes”. It doesn’t work with fully recursive self-improvement. With linear chains of self-improvement, looking at humans indicates that it works less well the more you do.
I’m not following I think? Since all capable RL systems are already actor-critic of some sort, it seems like the alignment tax for this one is very near zero. And it seems like it works with recursive self-improvement the way any primary goal does: it’s reflectively stable. Making a decision to change your goal is completely against any decision system that effectively pursues that goal and can anticipate outcomes.
I’m 0% worried about AutoGPT. RSI requires self-modification requires control of (something like) training. In that case, “Internal independent review for language model agent alignment” doesn’t work.
I’m also 0% worried about AutoGPT. I described the systems I am worried about in Capabilities and alignment of LLM cognitive architectures. I agree that they need to do online learning. That could be done with a better version of episodic memory, or weight retraining, or both. As above, if the same decision algorithm is used to decide what new knowledge to incorporate, and the system is smart enough to anticipate failure modes (and accurate predictions of outcomes seems necessary to be dangerous), it will avoid taking actions (including incorporating new knowledge) that change its current alignment/goals.
The Natural Abstraction Hypothesis is wrong when comparing different levels of intelligence. Bees and humans both have concepts for distance, but bees don’t have a concept for electrical resistance. So “Just Retarget The Search” doesn’t work for ASI.
I don’t think the natural abstraction hypothesis has to be true for retargeting the search to work. You just need to identify a representation of goals you like. That’s harder if the system doesn’t use exactly the same abstractions you do, but it’s far from obvious that it’s impossible—or even difficult. Value is complex and fragile, but nobody has argued convincingly for just how complex and fragile.
I’m evaluating these in the context of a slow takeoff, and using the primary goal as something like “follow this guy’s instructions”. [Instruction-following AGI is easier and more likely than value aligned AGI. This provides some corrigibility and the ability to use the nascent AGI as a collaborator in improving its alignment as it grows more capable, which seems to me like it should help any technical alignment approach pretty dramatically.
I’m not following I think? Since all capable RL systems are already actor-critic of some sort, it seems like the alignment tax for this one is very near zero. And it seems like it works with recursive self-improvement the way any primary goal does: it’s reflectively stable. Making a decision to change your goal is completely against any decision system that effectively pursues that goal and can anticipate outcomes.
Let’s say a parent system is generating a more-capable child system. If the parent could perfectly predict what the child system does, it wouldn’t need to make it in the first place. Your argument here assumes that either
the parent system can perfectly predict how well a child will follow its values
or the parent system won’t have its metrics gamed if it evaluates performance in practice
But humans are evidence that the effectiveness of both predictions and evaluations is limited. And my understanding of RSI indicates that effectiveness will be limited for AI too. So the amount of self-improvement that can be controlled effectively is very limited, 1 or perhaps 2 stages. Value drift increases with RSI amount and can’t be prevented.
But those are more about the problem. My thinking has come around to thinking that reflective stability is probably enough to counteract value drift. But it’s not guaranteed by any means.
Value drift will happen, but the question is how much? The existing agent will try to give successors the same alignment/goals it has (or preserve its own goals if it’s learning or otherwise self-modifying).
So there are two forces at work: an attempt to maintain alignment by the agent itself, and an accidental drift away from those values. The question is how much drift happens in the sum of those two forces.
If we’re talking about successors, it’s exactly solving the alignment problem again. I’d expect AGI to be better at that if it’s overall smarter/more cognitively competent than humans. If it’s not, I wouldn’t trust it to solve that problem alone, and I’d want humans involved. That’s why I call my alignment proposal “do what I mean and check”, DWIMAC as a variant of instruction-following; I wouldn’t want a parahuman-level AGI doing important things (like aligning a successor) without consulting closely with its creators before acting.
Once it’s smarter than human, I’d expect its alignment attempts to be good enough to largely succeed, even though some small amount of drift/imperfections seems inevitable.
If we need a totally precise value alignment for success, that wouldn’t work. But it seems like there are a variety of outcomes we’d find quite good, so the match doesn’t need to be perfect; there’s room for some drift.
So this is a complex issue, but I don’t think it’s probably a showstopper. But it’s another question that deserves more thought before we launch a real AGI that learns, self-improves, and helps design successors.
My thinking has come around to thinking that reflective stability is probably enough to counteract value drift.
I’m confident that’s wrong. (I also think you overestimate the stability of human values because you’re not considering the effect of stability of cultural environment.)
Why?
Consider how AutoGPT works. It spawns new processes that handle subtasks. But those subtasks are never perfectly aligned with the original task.
Again, it only works to a limited extent in humans.
it’s exactly solving the alignment problem again. I’d expect AGI to be better at that if it’s overall smarter/more cognitively competent than humans
That’s not the right way of thinking about it. There isn’t some threshold where you “solve the alignment problem” completely and then all future RSI has zero drift. All you can do is try to improve how well it’s solved under certain circumstances. As the child system gets smarter the problem is different and more difficult. That’s why you get value drift at each step.
I think you’re saying there will be nonzero drift, and that’s a possible problem. I agree. I just don’t think it’s likely to be a disastrous problem.
That post, on “minutes from a human alignment meeting”, is addressing something I think of as important but different than drift: value mis-specification, or equivalently, value mis-generalization. That could be a huge problem without drift playing a role, and vice versa; I think they’re pretty seperable.
I wasn’t trying to say what you took me to. I just meant that when each AI creates a successor, it has to solve the alignment problem again. I don’t think there will be zero drift at any point, just little enough to count as success. If an AGI cares about following instructions from a designated human, it could quite possibly create a successor that also cares about following instructions from that human. That’s potentially good enough alignment to makes humans lives a lot better and prevents their extinction. Each successor might have slightly different values in other areas from drift, but that would be okay if the largest core motivation stays approximately the same.
So I think the important question is how much drift and how close does the value match need to be. I tried to find all of the work/thinking on the question of how close a value match needs to be. But exactly how complex and fragile? addresses that but the discussion doesn’t get far and nobody references other work, so I think we just don’t know and need to work that out.
I used the example of following human instructions because that also provides some amount of a basin of attraction for alignment, so that close-enough-is-good-enough. But even without that, I think it’s pretty likely that reflective stability provides enough compensation for drift to essentially work and provide good-enough alignment indefinitely.
But again, it’s a question of how much drift and how much stabilization/alignment each AGI can perform.
I just don’t think it’s likely to be a disastrous problem.
But again, it’s a question of how much drift and how much stabilization/alignment each AGI can perform.
I’m not sure what exactly your intuition on this coming from, but you’re wrong here, and I’m afraid it’s not a matter of my opinion. But I guess we’ll have to agree to disagree.
I suppose I’m obligated to reply after I posted what I did, so, looking at “We have promising alignment plans with low taxes”:
The system design of “Learning & Steering” has very high “taxes”. It doesn’t work with fully recursive self-improvement. With linear chains of self-improvement, looking at humans indicates that it works less well the more you do.
I’m 0% worried about AutoGPT. RSI requires self-modification requires control of (something like) training. In that case, “Internal independent review for language model agent alignment” doesn’t work.
The Natural Abstraction Hypothesis is wrong when comparing different levels of intelligence. Bees and humans both have concepts for distance, but bees don’t have a concept for electrical resistance. So “Just Retarget The Search” doesn’t work for ASI.
Thanks for taking a look! You’re not obligated to pursue this farther, although I do really want to get some skeptics to fully understand these proposals to poke holes in them.
I don’t think any of these are foolproof, but neither are the guaranteed or even likely to fail AFAICT from the critical analysis people have offered so far. To your specific points (which I am probably misunderstanding in places, sorry for that):
I’m not following I think? Since all capable RL systems are already actor-critic of some sort, it seems like the alignment tax for this one is very near zero. And it seems like it works with recursive self-improvement the way any primary goal does: it’s reflectively stable. Making a decision to change your goal is completely against any decision system that effectively pursues that goal and can anticipate outcomes.
I’m also 0% worried about AutoGPT. I described the systems I am worried about in Capabilities and alignment of LLM cognitive architectures. I agree that they need to do online learning. That could be done with a better version of episodic memory, or weight retraining, or both. As above, if the same decision algorithm is used to decide what new knowledge to incorporate, and the system is smart enough to anticipate failure modes (and accurate predictions of outcomes seems necessary to be dangerous), it will avoid taking actions (including incorporating new knowledge) that change its current alignment/goals.
I don’t think the natural abstraction hypothesis has to be true for retargeting the search to work. You just need to identify a representation of goals you like. That’s harder if the system doesn’t use exactly the same abstractions you do, but it’s far from obvious that it’s impossible—or even difficult. Value is complex and fragile, but nobody has argued convincingly for just how complex and fragile.
I’m evaluating these in the context of a slow takeoff, and using the primary goal as something like “follow this guy’s instructions”. [Instruction-following AGI is easier and more likely than value aligned AGI. This provides some corrigibility and the ability to use the nascent AGI as a collaborator in improving its alignment as it grows more capable, which seems to me like it should help any technical alignment approach pretty dramatically.
Let’s say a parent system is generating a more-capable child system. If the parent could perfectly predict what the child system does, it wouldn’t need to make it in the first place. Your argument here assumes that either
the parent system can perfectly predict how well a child will follow its values
or the parent system won’t have its metrics gamed if it evaluates performance in practice
But humans are evidence that the effectiveness of both predictions and evaluations is limited. And my understanding of RSI indicates that effectiveness will be limited for AI too. So the amount of self-improvement that can be controlled effectively is very limited, 1 or perhaps 2 stages. Value drift increases with RSI amount and can’t be prevented.
Alignment/value drift is definitely something I’m concerned about.
I wrote about it in a paper, Goal changes in intelligent agents and a post, The alignment stability problem.
But those are more about the problem. My thinking has come around to thinking that reflective stability is probably enough to counteract value drift. But it’s not guaranteed by any means.
Value drift will happen, but the question is how much? The existing agent will try to give successors the same alignment/goals it has (or preserve its own goals if it’s learning or otherwise self-modifying).
So there are two forces at work: an attempt to maintain alignment by the agent itself, and an accidental drift away from those values. The question is how much drift happens in the sum of those two forces.
If we’re talking about successors, it’s exactly solving the alignment problem again. I’d expect AGI to be better at that if it’s overall smarter/more cognitively competent than humans. If it’s not, I wouldn’t trust it to solve that problem alone, and I’d want humans involved. That’s why I call my alignment proposal “do what I mean and check”, DWIMAC as a variant of instruction-following; I wouldn’t want a parahuman-level AGI doing important things (like aligning a successor) without consulting closely with its creators before acting.
Once it’s smarter than human, I’d expect its alignment attempts to be good enough to largely succeed, even though some small amount of drift/imperfections seems inevitable.
If we need a totally precise value alignment for success, that wouldn’t work. But it seems like there are a variety of outcomes we’d find quite good, so the match doesn’t need to be perfect; there’s room for some drift.
So this is a complex issue, but I don’t think it’s probably a showstopper. But it’s another question that deserves more thought before we launch a real AGI that learns, self-improves, and helps design successors.
I’m confident that’s wrong. (I also think you overestimate the stability of human values because you’re not considering the effect of stability of cultural environment.)
Why?
Consider how AutoGPT works. It spawns new processes that handle subtasks. But those subtasks are never perfectly aligned with the original task.
Again, it only works to a limited extent in humans.
That’s not the right way of thinking about it. There isn’t some threshold where you “solve the alignment problem” completely and then all future RSI has zero drift. All you can do is try to improve how well it’s solved under certain circumstances. As the child system gets smarter the problem is different and more difficult. That’s why you get value drift at each step.
see also this post
I think you’re saying there will be nonzero drift, and that’s a possible problem. I agree. I just don’t think it’s likely to be a disastrous problem.
That post, on “minutes from a human alignment meeting”, is addressing something I think of as important but different than drift: value mis-specification, or equivalently, value mis-generalization. That could be a huge problem without drift playing a role, and vice versa; I think they’re pretty seperable.
I wasn’t trying to say what you took me to. I just meant that when each AI creates a successor, it has to solve the alignment problem again. I don’t think there will be zero drift at any point, just little enough to count as success. If an AGI cares about following instructions from a designated human, it could quite possibly create a successor that also cares about following instructions from that human. That’s potentially good enough alignment to makes humans lives a lot better and prevents their extinction. Each successor might have slightly different values in other areas from drift, but that would be okay if the largest core motivation stays approximately the same.
So I think the important question is how much drift and how close does the value match need to be. I tried to find all of the work/thinking on the question of how close a value match needs to be. But exactly how complex and fragile? addresses that but the discussion doesn’t get far and nobody references other work, so I think we just don’t know and need to work that out.
I used the example of following human instructions because that also provides some amount of a basin of attraction for alignment, so that close-enough-is-good-enough. But even without that, I think it’s pretty likely that reflective stability provides enough compensation for drift to essentially work and provide good-enough alignment indefinitely.
But again, it’s a question of how much drift and how much stabilization/alignment each AGI can perform.
I’m not sure what exactly your intuition on this coming from, but you’re wrong here, and I’m afraid it’s not a matter of my opinion. But I guess we’ll have to agree to disagree.