Corrigibility is like… a mountain with an empty swimming pool at the top? If you can land in the pool, you’ll tend to stay there, and it’s easy to roll from the shallow end of the pool to the deep end. And the pool seems like a much easier target to hit than the teacup. But if you miss the pool, you’ll slide all the way down the mountain.
Also, the swimming pool is lined with explosives that are wired to blow up whenever you travel deeper into the Grand Canyon.
(OK, maybe some metaphors are not meant to be mixed...)
Not an answer, but I think of “adversarial coherence” (the agent keeps optimizing for the same utility function even under perturbations by weaker optimizing processes, like how humans will fix errors in building a house or AlphaZero can win a game of Go even when an opponent tries to disrupt its strategy) as a property that training processes could select for. Adversarial coherence and corrigibility are incompatible.
Corrigibility is like… a mountain with an empty swimming pool at the top? If you can land in the pool, you’ll tend to stay there, and it’s easy to roll from the shallow end of the pool to the deep end. And the pool seems like a much easier target to hit than the teacup. But if you miss the pool, you’ll slide all the way down the mountain.
Also, the swimming pool is lined with explosives that are wired to blow up whenever you travel deeper into the Grand Canyon.
(OK, maybe some metaphors are not meant to be mixed...)
The best resource that I have found on why corrigibility is so hard is the arbital post, are there other good summaries that I should read?
Not an answer, but I think of “adversarial coherence” (the agent keeps optimizing for the same utility function even under perturbations by weaker optimizing processes, like how humans will fix errors in building a house or AlphaZero can win a game of Go even when an opponent tries to disrupt its strategy) as a property that training processes could select for. Adversarial coherence and corrigibility are incompatible.