If the odds of slipping it by governments and miltaries is slight, wouldn’t the conclusion be the opposite—we should spread understanding of AGI alignment issues so that those in power have thought about them by the time they appropriate the leading projects?
The original post has been arguing that this leads to a hyperexistential catastrophe, and it’s better to let them destroy everything, if they are to win the race.
But you have a different model implied here:
I hope that emphasizing corrigability might be adequate. That would at least let the one group who’ve controlled creation of AGI change their minds down the road.
Can you describe in more detail how you picture this going? I think I can guess, and I have objections to that vision, but I’d prefer if you outline it first.
You are probably guessing correctly. I’m hoping that whoever gets ahold of aligned AGI will also make it corrigible, and that over time they’ll trend toward a similar moral view to that generally held in this community. It doesn’t have to be fast.
To be fair, I’m probably pretty biased against the idea that all we can realistically hope for is extinction. The recent [case against AGI alignment](https://www.lesswrong.com/posts/CtXaFo3hikGMWW4C9/the-case-against-ai-alignment) post was the first time I’d seen arguments that strong in that direction. I haven’t really assimilated them yet.
My take on human nature is that, while humans are often stunningly vicious, they are also often remarkably generous. Further, it seems that the viciousness is usually happening when they feel materially threatened. Someone in charge of an aligned AGI will not feel very threatened for very long. And generosity will be safer and easier than it usually is.
The problem with that is that “corrigibility” may be a transient feature. As in, you train up a corrigible AI, it starts up very uncertain about which values it should enforce/how it should engage with the world. You give it feedback, and gradually make it more and more certain about some aspects of its behavior, so it can use its own judgement instead of constantly querying you. Eventually, you lock in some understanding of how it should extrapolate your values, and then the “corrigibility” phase is past and it just goes to rearrange reality to your preferences.
And my concern, here, is that in the rearranged reality, there may not be any places for you to change your mind. Like, say you really hate people from Category A, and tell the AI to make them suffer eternally. Do you then visit their hell to gloat? Probably not: you’re just happy knowing they’re suffering in the abstract. Or maybe you do visit, and see the warped visages of these monsters with no humanity left in them, and think that yeah, that seems just and good.
I think those are perfectly good concerns. But they don’t seem so likely that they make me want to exterminate humanity to avoid them.
I think you’re describing a failure of corrigibility. Which could certainly happen, for the reason you give. But it does seem quite possible (and perhaps likely) that an agentic system will be designed primarily for corrigibility, or alternately, alignment by obedience.
The second seems like a failure of morality. Which could certainly happen. But I see very few people who both enjoy inflicting suffering, and who would continue to enjoy that even given unlimited time and resources to become happy themselves.
The original post has been arguing that this leads to a hyperexistential catastrophe, and it’s better to let them destroy everything, if they are to win the race.
But you have a different model implied here:
Can you describe in more detail how you picture this going? I think I can guess, and I have objections to that vision, but I’d prefer if you outline it first.
You are probably guessing correctly. I’m hoping that whoever gets ahold of aligned AGI will also make it corrigible, and that over time they’ll trend toward a similar moral view to that generally held in this community. It doesn’t have to be fast.
To be fair, I’m probably pretty biased against the idea that all we can realistically hope for is extinction. The recent [case against AGI alignment](https://www.lesswrong.com/posts/CtXaFo3hikGMWW4C9/the-case-against-ai-alignment) post was the first time I’d seen arguments that strong in that direction. I haven’t really assimilated them yet.
My take on human nature is that, while humans are often stunningly vicious, they are also often remarkably generous. Further, it seems that the viciousness is usually happening when they feel materially threatened. Someone in charge of an aligned AGI will not feel very threatened for very long. And generosity will be safer and easier than it usually is.
The problem with that is that “corrigibility” may be a transient feature. As in, you train up a corrigible AI, it starts up very uncertain about which values it should enforce/how it should engage with the world. You give it feedback, and gradually make it more and more certain about some aspects of its behavior, so it can use its own judgement instead of constantly querying you. Eventually, you lock in some understanding of how it should extrapolate your values, and then the “corrigibility” phase is past and it just goes to rearrange reality to your preferences.
And my concern, here, is that in the rearranged reality, there may not be any places for you to change your mind. Like, say you really hate people from Category A, and tell the AI to make them suffer eternally. Do you then visit their hell to gloat? Probably not: you’re just happy knowing they’re suffering in the abstract. Or maybe you do visit, and see the warped visages of these monsters with no humanity left in them, and think that yeah, that seems just and good.
I think those are perfectly good concerns. But they don’t seem so likely that they make me want to exterminate humanity to avoid them.
I think you’re describing a failure of corrigibility. Which could certainly happen, for the reason you give. But it does seem quite possible (and perhaps likely) that an agentic system will be designed primarily for corrigibility, or alternately, alignment by obedience.
The second seems like a failure of morality. Which could certainly happen. But I see very few people who both enjoy inflicting suffering, and who would continue to enjoy that even given unlimited time and resources to become happy themselves.