That sounds accurate. In the particular setting of RLHF (that this paper attempts to simulate), I think there are actually three levels of proxy:
The train-test gap (our RM (and resulting policies) are less valid out of distribution)
The data-RM gap (our RM doesn’t capture the data perfectly
The intent-data gap (the fact that the data doesn’t necessarily accurately reflect the human intent (i.e things that just look good to the human given the sensors they have, as opposed to what they actually want).
Regularization likely helps a lot but I think the main reason why regularization is insufficient as a full solution to Goodhart is that it breaks if the simplest generalization of the training set is bad (or if the data is broken in some way). In particular, there are specific kinds of bad generalizations that are consistently bad and potentially simple. For instance I would think of things like ELK human simulators and deceptive alignment as all fitting into this framework.
(I also want to flag that I think in particular we have a very different ontology from each other, so I expect you will probably disagree with/find the previous claim strange, but I think the crux actually lies somewhere else)
That sounds accurate. In the particular setting of RLHF (that this paper attempts to simulate), I think there are actually three levels of proxy:
The train-test gap (our RM (and resulting policies) are less valid out of distribution)
The data-RM gap (our RM doesn’t capture the data perfectly
The intent-data gap (the fact that the data doesn’t necessarily accurately reflect the human intent (i.e things that just look good to the human given the sensors they have, as opposed to what they actually want).
Regularization likely helps a lot but I think the main reason why regularization is insufficient as a full solution to Goodhart is that it breaks if the simplest generalization of the training set is bad (or if the data is broken in some way). In particular, there are specific kinds of bad generalizations that are consistently bad and potentially simple. For instance I would think of things like ELK human simulators and deceptive alignment as all fitting into this framework.
(I also want to flag that I think in particular we have a very different ontology from each other, so I expect you will probably disagree with/find the previous claim strange, but I think the crux actually lies somewhere else)