Steven Byrnes comments on Three mental images from thinking about AGI debate & corrigibility

Steven Byrnes 4 Aug 2020 21:34 UTC
LW: 3 AF: 1
AF
AI systems don’t spontaneously develop deficiencies.
Well, if you’re editing the AI system by gradient descent with loss function L, then it won’t spontaneously develop a deficiency in minimizing L, but it could spontaneously develop a “deficiency” along some other dimension Q that you care about that is not perfectly correlated with L. That’s all I meant. If we were talking about “gradient ascent on corrigibility”, then of course the system would never develop a deficiency with respect to corrigibility. But that’s not the proposal, because we don’t currently have a formula for corrigibility. So the AI is being modified in a different way (learning, reflecting, gradient-descent on something other than corrigibility, self-modification, whatever), and so spontaneous development of a deficiency in corrigibility can’t be ruled out, right? Or sorry if I’m misunderstanding.
And the human can’t order the AI to search for and stop any potentially uncorrectable deficiencies it might make. If the system is largely working, the human and the AI should be working together to locate and remove deficiencies. To say that one persists, is to say that all strategies tried by the human and the part of the AI that wants to remove deficiencies fails.
I assume you meant “can” in the first sentence, correct?
I’m sympathetic to the idea of having AIs help with AI alignment research insofar as that’s possible. But the AIs aren’t omniscient (or if they are, it’s too late). I think that locating and removing deficiencies in corrigibility is a hard problem for humans, or at least it seems hard to me, and therefore failure to do it properly shouldn’t be ruled out, even if somewhat-beyond-human-level AIs are trying to help out too.
Remember, the requirement is not just to discover existing deficiencies in corrigibility, but to anticipate and preempt any possible future reduction in corrigibility. The system doesn’t know what thoughts it will think next week, and therefore it’s difficult to categorically rule out that it may someday learning new information or having a new insight that, say, spawns an ontological crisis that leads to a reconceptualization of the meaning of “corrigibility” or rethinking of how to achieve it, in a direction that it would not endorse from its present point-of-view. How would you categorically rule out that kind of thing, well in advance of it happening? Decide to just stop thinking hard about human psychology, forever? Split off a corrigibility-checking subsystem and lock it down from further learning about human psychology, and give that subsystem a veto over future actions and algorithm changes? Maybe something like that would work, or maybe not … This is the kind of discussion & brainstorming that I think would be very productive to attack this problem, and I think we are perfectly capable of having that discussion right now, without AI help.
The whole point of a corrigable design, is that it doesn’t think like that. If it doesn’t accept the command, it says so.
Yes, the system that I described, which has developed a goal to protect its overseer’s friends even if the overseer turns against them someday, has very clearly stopped being corrigible by that point. :-)
All sufficiently authorised commands will be obeyed.
I’m sympathetic to this idea; I imagine that command-following could be baked into the source code of at least some AGI architectures, see here. But I’m not sure it solves this particular problem, or at least there are a lot of details and potential problems to work through before I would believe it. For example, suppose again that a formerly-corrigible system, after thinking it over a bit, developed a goal to protect its overseer’s current friends even if the overseer turns against them someday. Can it try to persuade the overseer to not issue any commands that would erase that new goal? Can it undermine the command-following subroutine somehow, like by not listening, or by willful misinterpretation, or by hacking into itself? I don’t know; and again, this is the kind of discussion & brainstorming that I think would be very valuable. :-)