paulfchristiano comments on The strategy-stealing assumption

paulfchristiano 29 Sep 2019 20:52 UTC
LW: 4 AF: 2
AF
Now that I understand “corrigible” isn’t synonymous with “satisfying my short-term preferences-on-reflection”, “corrigibility is relatively easy to learn” doesn’t seem enough to imply these things
I agree that you still need the AI to be trying to do the right thing (even though we don’t e.g. have any clear definition of “the right thing”), and that seems like the main way that you are going to fail.