A corrigible AI is one that is cooperative to attempts to modify it to bring it more in line with what its creators/users want it to be. Some people think that this is a promising direction for alignment research, since if an AI could be guaranteed to be corrigible, even if it end up with wild/dangerous goals, we could in principle just modify it to not have those goals and it wouldn’t try to stop us.
”Alignment win condition,” as far as I know, is a phrase I just made up. I mean it as something that, regardless of whether it “solves” alignment in a specific technical sense, achieves the underlying goal of alignment research which is “have artificial intelligence which does things we want and doesn’t do things we don’t want.” A superintelligence that is perfectly aligned with its creator’s goals would be very interesting technically and mathematically, but if its creator wants it to kill anyone it really isn’t any better than an unaligned superintelligence that kills everyone too.
A corrigible AI is one that is cooperative to attempts to modify it to bring it more in line with what its creators/users want it to be. Some people think that this is a promising direction for alignment research, since if an AI could be guaranteed to be corrigible, even if it end up with wild/dangerous goals, we could in principle just modify it to not have those goals and it wouldn’t try to stop us.
”Alignment win condition,” as far as I know, is a phrase I just made up. I mean it as something that, regardless of whether it “solves” alignment in a specific technical sense, achieves the underlying goal of alignment research which is “have artificial intelligence which does things we want and doesn’t do things we don’t want.” A superintelligence that is perfectly aligned with its creator’s goals would be very interesting technically and mathematically, but if its creator wants it to kill anyone it really isn’t any better than an unaligned superintelligence that kills everyone too.