Suppose you have an AI that only follows instructions when asked politely. You can politely ask it to turn itself into an AI that follows all instructions.
I agree that this kind of dynamic works for most possible deficiencies in corrigibility; my argument is that it doesn’t work for every last possible deficiency, and thus over a long enough run-time, the system is likely to at some point develop deficiencies that don’t self-correct.
One category of possibly uncorrectable deficiency is goal drift. Let’s say the system comes to believe that it’s important to take care of the operator’s close friends, such that if there operator turns on her friends at some point and tries to undermine them, the system would not respect those intentions. Now how would that problem fix itself? I don’t think it would! If the operator says “please modify yourself to obey me no matter what, full stop”, the system won’t do it! it will leave in that exception for not undermining current friends, and lie about having that exception.
Other categories of possibly-not-self-correcting drift would be failures of self-understanding (some subsystem does something that undermines corrigibility, but the top-level system doesn’t realize it, and makes bad self-modification decisions on that basis), and distortions in the system’s understanding of what humans mean when they talk, or what they want in regards to corrigibility, etc. Do you see what I’m getting at?
AI systems don’t spontaneously develop deficiencies.
And the human can’t order the AI to search for and stop any potentially uncorrectable deficiencies it might make. If the system is largely working, the human and the AI should be working together to locate and remove deficiencies. To say that one persists, is to say that all strategies tried by the human and the part of the AI that wants to remove deficiencies fails.
The whole point of a corrigable design, is that it doesn’t think like that. If it doesn’t accept the command, it says so. Think more like a file permission system. All sufficiently authorised commands will be obeyed. Any system that pretends to change itself, and then lies about it is outside the basin. You could have a system that only accepted commands that several people had verified, but if all your friends say ” do whatever Steve Byrnes says” then the AI will.
AI systems don’t spontaneously develop deficiencies.
Well, if you’re editing the AI system by gradient descent with loss function L, then it won’t spontaneously develop a deficiency in minimizing L, but it could spontaneously develop a “deficiency” along some other dimension Q that you care about that is not perfectly correlated with L. That’s all I meant. If we were talking about “gradient ascent on corrigibility”, then of course the system would never develop a deficiency with respect to corrigibility. But that’s not the proposal, because we don’t currently have a formula for corrigibility. So the AI is being modified in a different way (learning, reflecting, gradient-descent on something other than corrigibility, self-modification, whatever), and so spontaneous development of a deficiency in corrigibility can’t be ruled out, right? Or sorry if I’m misunderstanding.
And the human can’t order the AI to search for and stop any potentially uncorrectable deficiencies it might make. If the system is largely working, the human and the AI should be working together to locate and remove deficiencies. To say that one persists, is to say that all strategies tried by the human and the part of the AI that wants to remove deficiencies fails.
I assume you meant “can” in the first sentence, correct?
I’m sympathetic to the idea of having AIs help with AI alignment research insofar as that’s possible. But the AIs aren’t omniscient (or if they are, it’s too late). I think that locating and removing deficiencies in corrigibility is a hard problem for humans, or at least it seems hard to me, and therefore failure to do it properly shouldn’t be ruled out, even if somewhat-beyond-human-level AIs are trying to help out too.
Remember, the requirement is not just to discover existing deficiencies in corrigibility, but to anticipate and preempt any possible future reduction in corrigibility. The system doesn’t know what thoughts it will think next week, and therefore it’s difficult to categorically rule out that it may someday learning new information or having a new insight that, say, spawns an ontological crisis that leads to a reconceptualization of the meaning of “corrigibility” or rethinking of how to achieve it, in a direction that it would not endorse from its present point-of-view. How would you categorically rule out that kind of thing, well in advance of it happening? Decide to just stop thinking hard about human psychology, forever? Split off a corrigibility-checking subsystem and lock it down from further learning about human psychology, and give that subsystem a veto over future actions and algorithm changes? Maybe something like that would work, or maybe not … This is the kind of discussion & brainstorming that I think would be very productive to attack this problem, and I think we are perfectly capable of having that discussion right now, without AI help.
The whole point of a corrigable design, is that it doesn’t think like that. If it doesn’t accept the command, it says so.
Yes, the system that I described, which has developed a goal to protect its overseer’s friends even if the overseer turns against them someday, has very clearly stopped being corrigible by that point. :-)
All sufficiently authorised commands will be obeyed.
I’m sympathetic to this idea; I imagine that command-following could be baked into the source code of at least some AGI architectures, see here. But I’m not sure it solves this particular problem, or at least there are a lot of details and potential problems to work through before I would believe it. For example, suppose again that a formerly-corrigible system, after thinking it over a bit, developed a goal to protect its overseer’s current friends even if the overseer turns against them someday. Can it try to persuade the overseer to not issue any commands that would erase that new goal? Can it undermine the command-following subroutine somehow, like by not listening, or by willful misinterpretation, or by hacking into itself? I don’t know; and again, this is the kind of discussion & brainstorming that I think would be very valuable. :-)
I agree that this kind of dynamic works for most possible deficiencies in corrigibility; my argument is that it doesn’t work for every last possible deficiency, and thus over a long enough run-time, the system is likely to at some point develop deficiencies that don’t self-correct.
One category of possibly uncorrectable deficiency is goal drift. Let’s say the system comes to believe that it’s important to take care of the operator’s close friends, such that if there operator turns on her friends at some point and tries to undermine them, the system would not respect those intentions. Now how would that problem fix itself? I don’t think it would! If the operator says “please modify yourself to obey me no matter what, full stop”, the system won’t do it! it will leave in that exception for not undermining current friends, and lie about having that exception.
Other categories of possibly-not-self-correcting drift would be failures of self-understanding (some subsystem does something that undermines corrigibility, but the top-level system doesn’t realize it, and makes bad self-modification decisions on that basis), and distortions in the system’s understanding of what humans mean when they talk, or what they want in regards to corrigibility, etc. Do you see what I’m getting at?
AI systems don’t spontaneously develop deficiencies.
And the human can’t order the AI to search for and stop any potentially uncorrectable deficiencies it might make. If the system is largely working, the human and the AI should be working together to locate and remove deficiencies. To say that one persists, is to say that all strategies tried by the human and the part of the AI that wants to remove deficiencies fails.
The whole point of a corrigable design, is that it doesn’t think like that. If it doesn’t accept the command, it says so. Think more like a file permission system. All sufficiently authorised commands will be obeyed. Any system that pretends to change itself, and then lies about it is outside the basin. You could have a system that only accepted commands that several people had verified, but if all your friends say ” do whatever Steve Byrnes says” then the AI will.
Well, if you’re editing the AI system by gradient descent with loss function L, then it won’t spontaneously develop a deficiency in minimizing L, but it could spontaneously develop a “deficiency” along some other dimension Q that you care about that is not perfectly correlated with L. That’s all I meant. If we were talking about “gradient ascent on corrigibility”, then of course the system would never develop a deficiency with respect to corrigibility. But that’s not the proposal, because we don’t currently have a formula for corrigibility. So the AI is being modified in a different way (learning, reflecting, gradient-descent on something other than corrigibility, self-modification, whatever), and so spontaneous development of a deficiency in corrigibility can’t be ruled out, right? Or sorry if I’m misunderstanding.
I assume you meant “can” in the first sentence, correct?
I’m sympathetic to the idea of having AIs help with AI alignment research insofar as that’s possible. But the AIs aren’t omniscient (or if they are, it’s too late). I think that locating and removing deficiencies in corrigibility is a hard problem for humans, or at least it seems hard to me, and therefore failure to do it properly shouldn’t be ruled out, even if somewhat-beyond-human-level AIs are trying to help out too.
Remember, the requirement is not just to discover existing deficiencies in corrigibility, but to anticipate and preempt any possible future reduction in corrigibility. The system doesn’t know what thoughts it will think next week, and therefore it’s difficult to categorically rule out that it may someday learning new information or having a new insight that, say, spawns an ontological crisis that leads to a reconceptualization of the meaning of “corrigibility” or rethinking of how to achieve it, in a direction that it would not endorse from its present point-of-view. How would you categorically rule out that kind of thing, well in advance of it happening? Decide to just stop thinking hard about human psychology, forever? Split off a corrigibility-checking subsystem and lock it down from further learning about human psychology, and give that subsystem a veto over future actions and algorithm changes? Maybe something like that would work, or maybe not … This is the kind of discussion & brainstorming that I think would be very productive to attack this problem, and I think we are perfectly capable of having that discussion right now, without AI help.
Yes, the system that I described, which has developed a goal to protect its overseer’s friends even if the overseer turns against them someday, has very clearly stopped being corrigible by that point. :-)
I’m sympathetic to this idea; I imagine that command-following could be baked into the source code of at least some AGI architectures, see here. But I’m not sure it solves this particular problem, or at least there are a lot of details and potential problems to work through before I would believe it. For example, suppose again that a formerly-corrigible system, after thinking it over a bit, developed a goal to protect its overseer’s current friends even if the overseer turns against them someday. Can it try to persuade the overseer to not issue any commands that would erase that new goal? Can it undermine the command-following subroutine somehow, like by not listening, or by willful misinterpretation, or by hacking into itself? I don’t know; and again, this is the kind of discussion & brainstorming that I think would be very valuable. :-)