One can think of overfitting as a special case of Goodhart’s law where the proxy is the score on
some finite set of samples, whereas our actual objective includes its generalization properties as well.
When we train on dataset X and deploy on actual data Y, that is just one level of proxy. But if you train a proxy utility function on some dataset X and then use that to train a model on dataset Y and finally deploy on actual data Z, we now have a two levels of proxy: mismatch between the proxy and the real utility function, and the standard train-test mismatch between Y and Z.
Still this analogy suggests that regularization is also the general solution to Goodharting? There is a tradeoff of course, but as you simplify the proxy utility function you can reduce overfitting and increase robustness. This also suggests that perhaps the proxy utility function should be regularized/simplified beyond that which just maximizes fit to the proxy training.
That sounds accurate. In the particular setting of RLHF (that this paper attempts to simulate), I think there are actually three levels of proxy:
The train-test gap (our RM (and resulting policies) are less valid out of distribution)
The data-RM gap (our RM doesn’t capture the data perfectly
The intent-data gap (the fact that the data doesn’t necessarily accurately reflect the human intent (i.e things that just look good to the human given the sensors they have, as opposed to what they actually want).
Regularization likely helps a lot but I think the main reason why regularization is insufficient as a full solution to Goodhart is that it breaks if the simplest generalization of the training set is bad (or if the data is broken in some way). In particular, there are specific kinds of bad generalizations that are consistently bad and potentially simple. For instance I would think of things like ELK human simulators and deceptive alignment as all fitting into this framework.
(I also want to flag that I think in particular we have a very different ontology from each other, so I expect you will probably disagree with/find the previous claim strange, but I think the crux actually lies somewhere else)
I found this in particular very interesting:
When we train on dataset X and deploy on actual data Y, that is just one level of proxy. But if you train a proxy utility function on some dataset X and then use that to train a model on dataset Y and finally deploy on actual data Z, we now have a two levels of proxy: mismatch between the proxy and the real utility function, and the standard train-test mismatch between Y and Z.
Still this analogy suggests that regularization is also the general solution to Goodharting? There is a tradeoff of course, but as you simplify the proxy utility function you can reduce overfitting and increase robustness. This also suggests that perhaps the proxy utility function should be regularized/simplified beyond that which just maximizes fit to the proxy training.
That sounds accurate. In the particular setting of RLHF (that this paper attempts to simulate), I think there are actually three levels of proxy:
The train-test gap (our RM (and resulting policies) are less valid out of distribution)
The data-RM gap (our RM doesn’t capture the data perfectly
The intent-data gap (the fact that the data doesn’t necessarily accurately reflect the human intent (i.e things that just look good to the human given the sensors they have, as opposed to what they actually want).
Regularization likely helps a lot but I think the main reason why regularization is insufficient as a full solution to Goodhart is that it breaks if the simplest generalization of the training set is bad (or if the data is broken in some way). In particular, there are specific kinds of bad generalizations that are consistently bad and potentially simple. For instance I would think of things like ELK human simulators and deceptive alignment as all fitting into this framework.
(I also want to flag that I think in particular we have a very different ontology from each other, so I expect you will probably disagree with/find the previous claim strange, but I think the crux actually lies somewhere else)