paulfchristiano comments on Two Neglected Problems in Human-AI Safety

paulfchristiano 19 Dec 2018 21:22 UTC
LW: 4 AF: 2
AF
in some ways corrigibility is actually opposed to safety
We can talk about “corrigible by X” for arbitrary X. I don’t think these considerations imply a tension between corrigibility and safety, they just suggest “humans in the real world” may not be the optimal X. You might prefer use an appropriate idealization of humans / humans in some safe environment / etc.
- Wei Dai 19 Dec 2018 21:30 UTC
  LW: 4 AF: 2
  AF Parent
  To the extent that even idealized humans are not perfectly safe (e.g., perhaps a white-box metaphilosophical approach is even safer), and that corrigibility seems to conflict with greater transparency and hence cooperation between AIs, there still seems to be some tension between corrigibility and safety even when X = idealized humans.
  
  ETA: Do you think IDA can be used to produce an AI that is corrigible by some kind of idealized human? That might be another approach that’s worth pursuing if it looks feasible.
  - paulfchristiano 20 Dec 2018 21:32 UTC
    LW: 2 AF: 1
    AF Parent
    ETA: Do you think IDA can be used to produce an AI that is corrigible by some kind of idealized human? That might be another approach that’s worth pursuing if it looks feasible.
    Yes.