The Harms version of corrigibility is pretty similar in that it should take instructions first and foremost, even though it’s got a more elaborate model of the user’s preferences to help in interpreting instructions correctly, and it’s supposed to act on its own initiative in some cases. But the two approaches may converge almost completely after a user has given a wise set of standing instructions to their DWIMAC AGI.
Note that the link to the Harms version of corrigibility doesn’t work.
Note that the link to the Harms version of corrigibility doesn’t work.
Thank you! Fixed.