Thomas Kwa comments on A central AI alignment problem: capabilities generalization, and the sharp left turn

Thomas Kwa 22 Jul 2022 23:11 UTC
3 points
0
Not an answer, but I think of “adversarial coherence” (the agent keeps optimizing for the same utility function even under perturbations by weaker optimizing processes, like how humans will fix errors in building a house or AlphaZero can win a game of Go even when an opponent tries to disrupt its strategy) as a property that training processes could select for. Adversarial coherence and corrigibility are incompatible.