The problem with that is that “corrigibility” may be a transient feature. As in, you train up a corrigible AI, it starts up very uncertain about which values it should enforce/how it should engage with the world. You give it feedback, and gradually make it more and more certain about some aspects of its behavior, so it can use its own judgement instead of constantly querying you. Eventually, you lock in some understanding of how it should extrapolate your values, and then the “corrigibility” phase is past and it just goes to rearrange reality to your preferences.
And my concern, here, is that in the rearranged reality, there may not be any places for you to change your mind. Like, say you really hate people from Category A, and tell the AI to make them suffer eternally. Do you then visit their hell to gloat? Probably not: you’re just happy knowing they’re suffering in the abstract. Or maybe you do visit, and see the warped visages of these monsters with no humanity left in them, and think that yeah, that seems just and good.
I think those are perfectly good concerns. But they don’t seem so likely that they make me want to exterminate humanity to avoid them.
I think you’re describing a failure of corrigibility. Which could certainly happen, for the reason you give. But it does seem quite possible (and perhaps likely) that an agentic system will be designed primarily for corrigibility, or alternately, alignment by obedience.
The second seems like a failure of morality. Which could certainly happen. But I see very few people who both enjoy inflicting suffering, and who would continue to enjoy that even given unlimited time and resources to become happy themselves.
The problem with that is that “corrigibility” may be a transient feature. As in, you train up a corrigible AI, it starts up very uncertain about which values it should enforce/how it should engage with the world. You give it feedback, and gradually make it more and more certain about some aspects of its behavior, so it can use its own judgement instead of constantly querying you. Eventually, you lock in some understanding of how it should extrapolate your values, and then the “corrigibility” phase is past and it just goes to rearrange reality to your preferences.
And my concern, here, is that in the rearranged reality, there may not be any places for you to change your mind. Like, say you really hate people from Category A, and tell the AI to make them suffer eternally. Do you then visit their hell to gloat? Probably not: you’re just happy knowing they’re suffering in the abstract. Or maybe you do visit, and see the warped visages of these monsters with no humanity left in them, and think that yeah, that seems just and good.
I think those are perfectly good concerns. But they don’t seem so likely that they make me want to exterminate humanity to avoid them.
I think you’re describing a failure of corrigibility. Which could certainly happen, for the reason you give. But it does seem quite possible (and perhaps likely) that an agentic system will be designed primarily for corrigibility, or alternately, alignment by obedience.
The second seems like a failure of morality. Which could certainly happen. But I see very few people who both enjoy inflicting suffering, and who would continue to enjoy that even given unlimited time and resources to become happy themselves.