It also transfers in an obvious way to AGI programming, where it would correspond to something like an automated “interpretability” module that tries to make sense of the AGI’s latent variables by correlating them with some other labeled properties of the AGI’s inputs, and then rewarding the AGI for “thinking about the right things” (according to the interpretability module’s output), which in turn helps turn those thoughts into the AGI’s goals.
(Is this a good design idea that AGI programmers should adopt? I don’t know, but I find it interesting, and at least worthy of further thought. I don’t recall coming across this idea before in the context of inner alignment.)
Fwiw, I think this is basically a form of relaxed adversarial training, which is my favored solution for inner alignment.
Fwiw, I think this is basically a form of relaxed adversarial training, which is my favored solution for inner alignment.