I think this is a fascinating question, which is irrelevant to technical alignment and our near-term survival. I think not only that Corrigibility or DWIM is an attractive primary goal for AGI, but that it’s so much easier as to almost certainly be what people actually try for their first AGI alignment attempts. Uunderstanding what you mean by what you say and checking when they’re not sure is much simpler than understanding and implementing an ideal ethics. Alignment is hard enough without solving ethics, so we’ll put that part off because we can.
I think you’re hitting an important tension here: being nice and liberal seems like the value we’d endorse. The big problem is: if you’re tolerant of those with other values, will they be tolerant of you? How would you know if they’ll lie until they have power to do what they want with your backyard (deceptive alignment), or genuinely change their minds once they have that power?
The overal conclusion is that, while we’d like to be liberal and nice and tolerant, it will get us killed in a lot of situations where others aren’t tolerant in return. Which ones could use some more careful analysis.
This logic is laid out in detail by Yudkowsky across many posts. I think he’s considered the pull toward tolarance and niceness in detail. Steve Byrnes’ comment here hits some high points. It’s a topic worth more consideration.
I think this is a fascinating question, which is irrelevant to technical alignment and our near-term survival. I think not only that Corrigibility or DWIM is an attractive primary goal for AGI, but that it’s so much easier as to almost certainly be what people actually try for their first AGI alignment attempts. Uunderstanding what you mean by what you say and checking when they’re not sure is much simpler than understanding and implementing an ideal ethics. Alignment is hard enough without solving ethics, so we’ll put that part off because we can.
I think you’re hitting an important tension here: being nice and liberal seems like the value we’d endorse. The big problem is: if you’re tolerant of those with other values, will they be tolerant of you? How would you know if they’ll lie until they have power to do what they want with your backyard (deceptive alignment), or genuinely change their minds once they have that power?
The overal conclusion is that, while we’d like to be liberal and nice and tolerant, it will get us killed in a lot of situations where others aren’t tolerant in return. Which ones could use some more careful analysis.
This logic is laid out in detail by Yudkowsky across many posts. I think he’s considered the pull toward tolarance and niceness in detail. Steve Byrnes’ comment here hits some high points. It’s a topic worth more consideration.