I think the example of sugar is off. Sugar was not originally a proxy for vitamins, because sugar was rarer than vitamins. A taste for sugar was optimizing for calories, which at the time was heavily correlated with survival. If our ancestors had access to twinkies, they would have benefited from them. The problem isn’t that we became better at hacking the sugar signal, it’s that we evolved an open ended preference for sugar when the utility curve eventually becomes negative.
A potential replacement: we evolved to find bright, shiny colors in fruit attractive because that signified vitamins, and modern breeding techniques have completely hacked this.
I worry I’m being pedantic by bringing this up, but I think the difference between “hackable proxies” and “accurate proxies for which we mismodeled the underlying reality” is important.
Hmm, I think the fact that if our ancestors had access to twinkies they would have benefitted from them is why it is a correct example. The point is that we “learned” sugar is good from a training set in which sugar is low. Then, when we became better at optimizing for sugar, sugar became high and the proxy stopped working.
It seems to me that you are arguing that the sugar example is not Adversarial Goodhart, which I agree with. The thing where open ended preferences break because when you get too much the utility curve becomes negative, is one of the things I am trying to point at with Extremal Goodhart.
Okay, I think I disagree that extrapolating beyond the range of your data is Goodharting. I use the term for the narrower case where either the signal or the value stays in the trained range, but become very divergent from each other. E.g. artificial sweeteners break the link between sweetness and calories.
I don’t think this is quite isomorphic to the first paragraph, but highly related: I think of sweetness as a proxy for calories. Are you defining sweetness as a proxy for good for me?
I am thinking of sugar as a proxy for good for me.
I do not think that all instances of training data not matching the environment you are optimizing are Goodhart. However if the reason that the environment does not match the training is because the proxy is large, and the reason the proxy is large is because you are optimizing for it, then the optimization causes the failure of the proxy, which is why I am calling it Goodhart.
I think the example of sugar is off. Sugar was not originally a proxy for vitamins, because sugar was rarer than vitamins. A taste for sugar was optimizing for calories, which at the time was heavily correlated with survival. If our ancestors had access to twinkies, they would have benefited from them. The problem isn’t that we became better at hacking the sugar signal, it’s that we evolved an open ended preference for sugar when the utility curve eventually becomes negative.
A potential replacement: we evolved to find bright, shiny colors in fruit attractive because that signified vitamins, and modern breeding techniques have completely hacked this.
I worry I’m being pedantic by bringing this up, but I think the difference between “hackable proxies” and “accurate proxies for which we mismodeled the underlying reality” is important.
Hmm, I think the fact that if our ancestors had access to twinkies they would have benefitted from them is why it is a correct example. The point is that we “learned” sugar is good from a training set in which sugar is low. Then, when we became better at optimizing for sugar, sugar became high and the proxy stopped working.
It seems to me that you are arguing that the sugar example is not Adversarial Goodhart, which I agree with. The thing where open ended preferences break because when you get too much the utility curve becomes negative, is one of the things I am trying to point at with Extremal Goodhart.
Okay, I think I disagree that extrapolating beyond the range of your data is Goodharting. I use the term for the narrower case where either the signal or the value stays in the trained range, but become very divergent from each other. E.g. artificial sweeteners break the link between sweetness and calories.
I don’t think this is quite isomorphic to the first paragraph, but highly related: I think of sweetness as a proxy for calories. Are you defining sweetness as a proxy for good for me?
I am thinking of sugar as a proxy for good for me.
I do not think that all instances of training data not matching the environment you are optimizing are Goodhart. However if the reason that the environment does not match the training is because the proxy is large, and the reason the proxy is large is because you are optimizing for it, then the optimization causes the failure of the proxy, which is why I am calling it Goodhart.