Keep in mind that the overseer (two steps forward) is always far more powerful than the agent we’re distilling (one step back), is trained to not Goodhart, is training the new agent to not Goodhart (this is largely my interpretation of what corrigibility gets you), and is explicitly searching for ways in which the new agent may want to Goodhart.
Well, but Goodhart lurks in the soul of all of us; the question here is something like “what needs to be true about the overseer such that it does not Goodhart (and can recognize it in others)?”
Well, but Goodhart lurks in the soul of all of us; the question here is something like “what needs to be true about the overseer such that it does not Goodhart (and can recognize it in others)?”
Corrigibility. Without corrigibility I would be just as scared of Goodhart.