Corrigibility has various slightly different definitions, but the general rough idea is of an AI that does what we want
An aligned AI will also so what we want because it’s also what it wants, its terminal values are also ours.
I’ve always taken “control” to differ from alignment in that it means an AI doing what we want even if it isn’t what it wants, ie it has a terminal value of getting rewards, and our values are instrumental to that, if they figure at all.
And I take corrigibility to mean shaping an AIs values as you go along and therefore an outcome of control.
Sure, an AI that ignores what you ask, and implements some form of CEV or whatever isn’t corrigible. Corrigibility is more following instructions than having your utility function.
An aligned AI will also so what we want because it’s also what it wants, its terminal values are also ours.
I’ve always taken “control” to differ from alignment in that it means an AI doing what we want even if it isn’t what it wants, ie it has a terminal value of getting rewards, and our values are instrumental to that, if they figure at all.
And I take corrigibility to mean shaping an AIs values as you go along and therefore an outcome of control.
Sure, an AI that ignores what you ask, and implements some form of CEV or whatever isn’t corrigible. Corrigibility is more following instructions than having your utility function.