Thomas Larsen comments on A central AI alignment problem: capabilities generalization, and the sharp left turn

Thomas Larsen 22 Jul 2022 18:42 UTC
2 points
0
The best resource that I have found on why corrigibility is so hard is the arbital post, are there other good summaries that I should read?
- Thomas Kwa 22 Jul 2022 23:11 UTC
  3 points
  0
  Parent
  Not an answer, but I think of “adversarial coherence” (the agent keeps optimizing for the same utility function even under perturbations by weaker optimizing processes, like how humans will fix errors in building a house or AlphaZero can win a game of Go even when an opponent tries to disrupt its strategy) as a property that training processes could select for. Adversarial coherence and corrigibility are incompatible.