Rob Bensinger comments on A central AI alignment problem: capabilities generalization, and the sharp left turn

Rob Bensinger 15 Jun 2022 22:49 UTC
5 points
0
Corrigibility is like… a mountain with an empty swimming pool at the top? If you can land in the pool, you’ll tend to stay there, and it’s easy to roll from the shallow end of the pool to the deep end. And the pool seems like a much easier target to hit than the teacup. But if you miss the pool, you’ll slide all the way down the mountain.
Also, the swimming pool is lined with explosives that are wired to blow up whenever you travel deeper into the Grand Canyon.
(OK, maybe some metaphors are not meant to be mixed...)
- Thomas Larsen 22 Jul 2022 18:42 UTC
  2 points
  0
  Parent
  The best resource that I have found on why corrigibility is so hard is the arbital post, are there other good summaries that I should read?
  - Thomas Kwa 22 Jul 2022 23:11 UTC
    3 points
    0
    Parent
    Not an answer, but I think of “adversarial coherence” (the agent keeps optimizing for the same utility function even under perturbations by weaker optimizing processes, like how humans will fix errors in building a house or AlphaZero can win a game of Go even when an opponent tries to disrupt its strategy) as a property that training processes could select for. Adversarial coherence and corrigibility are incompatible.