Max Harms comments on Corrigibility could make things worse

Max Harms 13 Jun 2024 16:08 UTC
LW: 1 AF: 1
0
AF
Thanks! I now feel unconfused. To briefly echo back the key idea which I heard (and also agree with): a technique which can create a corrigible PAAI might have assumptions which break if that technique is used to make a different kind of AI (i.e. one aimed at CEV). If we call this technique “the Corrigibility method” then we may end up using the Corrigibility method to make AIs that aren’t at all corrigible, but merely seem corrigible, resulting in disaster.
This is a useful insight! Thanks for clarifying. :)