If corrigibility has one central problem, I would call it: How do you say “If A, then prefer B.” instead of “Prefer (if A, then B).”? Compare pytorch’s detach, which permits computation to pass forward, but prevents gradients from propagating backward, by acting as an identity function with derivative 0.
If corrigibility has one central problem, I would call it: How do you say “If A, then prefer B.” instead of “Prefer (if A, then B).”? Compare pytorch’s detach, which permits computation to pass forward, but prevents gradients from propagating backward, by acting as an identity function with derivative 0.