Not an answer, but I think of “adversarial coherence” (the agent keeps optimizing for the same utility function even under perturbations by weaker optimizing processes, like how humans will fix errors in building a house or AlphaZero can win a game of Go even when an opponent tries to disrupt its strategy) as a property that training processes could select for. Adversarial coherence and corrigibility are incompatible.
Not an answer, but I think of “adversarial coherence” (the agent keeps optimizing for the same utility function even under perturbations by weaker optimizing processes, like how humans will fix errors in building a house or AlphaZero can win a game of Go even when an opponent tries to disrupt its strategy) as a property that training processes could select for. Adversarial coherence and corrigibility are incompatible.