I’ve also sort of derived some informal arguments myself in the same vein, though I haven’t published them anywhere.
Basically, approximately all of the focus is on creating/aligning a consequentialist utility maximizer, but consequentialist utility maximizers don’t like being corrected, will tend to want to change your preferences, etc, which all seems bad for alignment.
I think this sort of counts?
https://www.lesswrong.com/posts/WCX3EwnWAx7eyucqH/corrigibility-can-be-vnm-incoherent
I’ve also sort of derived some informal arguments myself in the same vein, though I haven’t published them anywhere.
Basically, approximately all of the focus is on creating/aligning a consequentialist utility maximizer, but consequentialist utility maximizers don’t like being corrected, will tend to want to change your preferences, etc, which all seems bad for alignment.