Hmm, I think the fact that if our ancestors had access to twinkies they would have benefitted from them is why it is a correct example. The point is that we “learned” sugar is good from a training set in which sugar is low. Then, when we became better at optimizing for sugar, sugar became high and the proxy stopped working.
It seems to me that you are arguing that the sugar example is not Adversarial Goodhart, which I agree with. The thing where open ended preferences break because when you get too much the utility curve becomes negative, is one of the things I am trying to point at with Extremal Goodhart.
Okay, I think I disagree that extrapolating beyond the range of your data is Goodharting. I use the term for the narrower case where either the signal or the value stays in the trained range, but become very divergent from each other. E.g. artificial sweeteners break the link between sweetness and calories.
I don’t think this is quite isomorphic to the first paragraph, but highly related: I think of sweetness as a proxy for calories. Are you defining sweetness as a proxy for good for me?
I am thinking of sugar as a proxy for good for me.
I do not think that all instances of training data not matching the environment you are optimizing are Goodhart. However if the reason that the environment does not match the training is because the proxy is large, and the reason the proxy is large is because you are optimizing for it, then the optimization causes the failure of the proxy, which is why I am calling it Goodhart.
Hmm, I think the fact that if our ancestors had access to twinkies they would have benefitted from them is why it is a correct example. The point is that we “learned” sugar is good from a training set in which sugar is low. Then, when we became better at optimizing for sugar, sugar became high and the proxy stopped working.
It seems to me that you are arguing that the sugar example is not Adversarial Goodhart, which I agree with. The thing where open ended preferences break because when you get too much the utility curve becomes negative, is one of the things I am trying to point at with Extremal Goodhart.
Okay, I think I disagree that extrapolating beyond the range of your data is Goodharting. I use the term for the narrower case where either the signal or the value stays in the trained range, but become very divergent from each other. E.g. artificial sweeteners break the link between sweetness and calories.
I don’t think this is quite isomorphic to the first paragraph, but highly related: I think of sweetness as a proxy for calories. Are you defining sweetness as a proxy for good for me?
I am thinking of sugar as a proxy for good for me.
I do not think that all instances of training data not matching the environment you are optimizing are Goodhart. However if the reason that the environment does not match the training is because the proxy is large, and the reason the proxy is large is because you are optimizing for it, then the optimization causes the failure of the proxy, which is why I am calling it Goodhart.