Corrigibility is, at its heart, a relatively simple concept compared to good alternatives.
I don’t know about this, especially if obedience is part of corrigibility. In that case, it seems like the concept inherits all the complexity of human preferences. And then I’m concerned, because as you say:
When a training target is complex, we should expect the learner to be distracted by proxies and only get a shadow of what’s desired.
My claim is that obedience is an emergent part of corrigibility, rather than part of its definition. Building nanomachines is too complex to reliably instill as part of the core drive of an AI, but I still expect basically all ASIs to (instrumentally) desire building nanomachines.
I do think that the goals of “want what the principal wants” or “help the principal get what they want” are simpler goals than “maximize the arrangement of the universe according to this particular balance of beauty, non-suffering, joy, non-boredom, autonomy, sacredness, [217 other shards of human values, possibly including parochial desires unique to this principal].” While they point to similar things, training the pointer is easier in the sense that it’s up to the fully-intelligent agent to determine the balance and nature of the principal’s values, rather than having to load that complexity up-front in the training process. And indeed, if you’re trying to train for full alignment, you should almost certainly train for having a pointer, rather than training to give correct answers on e.g. trolley problems.
Is corrigibility simpler or more complex than these kinds of indirect/meta goals? I’m not sure. But both of these indirect goals are fragile, and probably lethal in practice.
An AI that wants to want what the principal wants may wipe out humanity if given the opportunity, as long as the principal’s brainstate is saved in the process. That action ensures it is free to accomplish its goal at its leisure (whereas if the humans shut it down, then it will never come to want what the principal wants).
An AI that wants to help the principal get what they want won’t (immediately) wipe out humanity, because it might turn out that doing so is against the principal’s desires. But such an agent might take actions which manipulate the principal (perhaps physically) into having easy-to-satisfy desires (e.g. paperclips).
So suppose we do a less naive thing and try to train a goal like “help the principal get what they want, but in a natural sort of way that doesn’t involve manipulating them to want different things.” Well, there are still a few potential issues, such as being sufficiently robust and conservative, such that flaws in the training process don’t persist/magnify over time. And as we walk down this path I think we either just get to corrigibility or we get to something significantly more complicated.
And indeed, if you’re trying to train for full alignment, you should almost certainly train for having a pointer, rather than training to give correct answers on e.g. trolley problems.
Yep, agreed. Although I worry that—if we try to train agents to have a pointer—these agents might end up having a goal more like:
maximize the arrangement of the universe according to this particular balance of beauty, non-suffering, joy, non-boredom, autonomy, sacredness, [217 other shards of human values, possibly including parochial desires unique to this principal].
I think it depends on how path-dependent the training process is. The pointer seems simpler, so the agent settles on the pointer in the low path-dependence world. But agents form representations of things like beauty, non-suffering, etc. before they form representations of human desires, so maybe these agents’ goals crystallize around these things in the high path-dependence world.
I don’t know about this, especially if obedience is part of corrigibility. In that case, it seems like the concept inherits all the complexity of human preferences. And then I’m concerned, because as you say:
My claim is that obedience is an emergent part of corrigibility, rather than part of its definition. Building nanomachines is too complex to reliably instill as part of the core drive of an AI, but I still expect basically all ASIs to (instrumentally) desire building nanomachines.
I do think that the goals of “want what the principal wants” or “help the principal get what they want” are simpler goals than “maximize the arrangement of the universe according to this particular balance of beauty, non-suffering, joy, non-boredom, autonomy, sacredness, [217 other shards of human values, possibly including parochial desires unique to this principal].” While they point to similar things, training the pointer is easier in the sense that it’s up to the fully-intelligent agent to determine the balance and nature of the principal’s values, rather than having to load that complexity up-front in the training process. And indeed, if you’re trying to train for full alignment, you should almost certainly train for having a pointer, rather than training to give correct answers on e.g. trolley problems.
Is corrigibility simpler or more complex than these kinds of indirect/meta goals? I’m not sure. But both of these indirect goals are fragile, and probably lethal in practice.
An AI that wants to want what the principal wants may wipe out humanity if given the opportunity, as long as the principal’s brainstate is saved in the process. That action ensures it is free to accomplish its goal at its leisure (whereas if the humans shut it down, then it will never come to want what the principal wants).
An AI that wants to help the principal get what they want won’t (immediately) wipe out humanity, because it might turn out that doing so is against the principal’s desires. But such an agent might take actions which manipulate the principal (perhaps physically) into having easy-to-satisfy desires (e.g. paperclips).
So suppose we do a less naive thing and try to train a goal like “help the principal get what they want, but in a natural sort of way that doesn’t involve manipulating them to want different things.” Well, there are still a few potential issues, such as being sufficiently robust and conservative, such that flaws in the training process don’t persist/magnify over time. And as we walk down this path I think we either just get to corrigibility or we get to something significantly more complicated.
Thanks, this comment was clarifying.
Yep, agreed. Although I worry that—if we try to train agents to have a pointer—these agents might end up having a goal more like:
I think it depends on how path-dependent the training process is. The pointer seems simpler, so the agent settles on the pointer in the low path-dependence world. But agents form representations of things like beauty, non-suffering, etc. before they form representations of human desires, so maybe these agents’ goals crystallize around these things in the high path-dependence world.