Idk, if you’re carving up the space into mutually exclusive “Causal Goodhart” and “Extremal Goodhart” problems, then I expect conditioning to have stronger Extremal Goodhart problems, just because RL can change causal mechanisms to lead to high performance, whereas conditioning has to get high performance just by sampling more and more extreme outputs.
(But mostly I think you don’t want to carve up the space into mutually exclusive “Causal Goodhart” and “Extremal Goodhart”.)
Extremal Goodhart is different from other forms of Goodhart in that a maximal value of something will always lead to less or zero true reward.
This is difficult to show, since you need to show that anything you maximize implies non maximal true reward. And that’s different from causal Goodhart, where the causal relationship is mistaken.
Extremal Goodhart is not differentially a problem for RL vs conditioning, right?
Idk, if you’re carving up the space into mutually exclusive “Causal Goodhart” and “Extremal Goodhart” problems, then I expect conditioning to have stronger Extremal Goodhart problems, just because RL can change causal mechanisms to lead to high performance, whereas conditioning has to get high performance just by sampling more and more extreme outputs.
(But mostly I think you don’t want to carve up the space into mutually exclusive “Causal Goodhart” and “Extremal Goodhart”.)
Extremal Goodhart is different from other forms of Goodhart in that a maximal value of something will always lead to less or zero true reward.
This is difficult to show, since you need to show that anything you maximize implies non maximal true reward. And that’s different from causal Goodhart, where the causal relationship is mistaken.
I think so, yes.