To the extent that even idealized humans are not perfectly safe (e.g., perhaps a white-box metaphilosophical approach is even safer), and that corrigibility seems to conflict with greater transparency and hence cooperation between AIs, there still seems to be some tension between corrigibility and safety even when X = idealized humans.
ETA: Do you think IDA can be used to produce an AI that is corrigible by some kind of idealized human? That might be another approach that’s worth pursuing if it looks feasible.
ETA: Do you think IDA can be used to produce an AI that is corrigible by some kind of idealized human? That might be another approach that’s worth pursuing if it looks feasible.
To the extent that even idealized humans are not perfectly safe (e.g., perhaps a white-box metaphilosophical approach is even safer), and that corrigibility seems to conflict with greater transparency and hence cooperation between AIs, there still seems to be some tension between corrigibility and safety even when X = idealized humans.
ETA: Do you think IDA can be used to produce an AI that is corrigible by some kind of idealized human? That might be another approach that’s worth pursuing if it looks feasible.
Yes.