An important thing to notice about Goodhart’s law is that we roll three different phenomena together. This isn’t entirely bad because the three phenomena are very similar, but sometimes it helps to think about them differently.
Goodhart Level 1: Following the sugar/fruit example, imagine there are a bunch of different fruits with different levels of nutrients and sugar, and the nutrient and sugar levels are very correlated. It is still the case that if you optimize for the most sugar, you will get reliably less nutrients than if you optimize for the most nutrients. (This is just the cost of using a proxy. I barely even want to call this Goodhart’s law.)
Goodhart Level 2: It is possible that foods with lots of sugar are usually very correlated with foods with lots of nutrients, but there is one type of fruit that is pure sugar with no nutrients. If this fruit occurred in nature, but only very rarely, it would not mess with the statistical correlation between sugar and nutrients very much. Thus when an agent optimizes very hard for sugar, they end up with no nutrients, but if they optimized only slightly, they would have found a normal fruit with lots of sugar and nutrients.
Note that this is more nuanced than just saying we didn’t have pure sugar in the ancestral environment and we do now, so the actions that were good in the ancestral environment are a bad proxy for the actions that are good now. (Maybe just using a bad proxy should be called Goodhart Level 0) The point is that the reason that the environment is different now is that we optimized for sugar. We pushed to the section of possible worlds with lots of sugar using our sugar optimization, and the correlation mostly only existed in the worlds with a moderate amount of sugar.
Goodhart Level 3: Say we live in a world with only a fixed amount of nutrients, and someone else wants a larger share of the nutrients. If you are using the proxy of sugar, and other people know this, and adversary might invent candy, and then trade you their candy for some of your fruit. Another agent had a goal that was in conflict with your true goal, but not in conflict with your proxy, so they exploited your proxy and created options that satisfied your proxy but not your true goal.
Many instances of people Goodharting themselves falls in Level 2 (If I don’t step on the scale, I optimize out of the worlds where the scale number is correlated with my weight). However, I claim that some instances might be at Level 3. In particular rationalization. Maybe part of me wants to save the world and uses the ability to produce justifications for why an action saves the world as a proxy for what saves the world. Then, a different part of me wants to goof off and produces a justification for why goofing off is actually the most world-saving action.
Awesome. Thanks for adding. I particularly like the inclusion of adversarial behavior into the mix—I hadn’t thought of the goal structure of the candymakers as undercutting/exploiting/taking advantage of the goal structure of the humans.
An important thing to notice about Goodhart’s law is that we roll three different phenomena together. This isn’t entirely bad because the three phenomena are very similar, but sometimes it helps to think about them differently.
Goodhart Level 1: Following the sugar/fruit example, imagine there are a bunch of different fruits with different levels of nutrients and sugar, and the nutrient and sugar levels are very correlated. It is still the case that if you optimize for the most sugar, you will get reliably less nutrients than if you optimize for the most nutrients. (This is just the cost of using a proxy. I barely even want to call this Goodhart’s law.)
Goodhart Level 2: It is possible that foods with lots of sugar are usually very correlated with foods with lots of nutrients, but there is one type of fruit that is pure sugar with no nutrients. If this fruit occurred in nature, but only very rarely, it would not mess with the statistical correlation between sugar and nutrients very much. Thus when an agent optimizes very hard for sugar, they end up with no nutrients, but if they optimized only slightly, they would have found a normal fruit with lots of sugar and nutrients.
Note that this is more nuanced than just saying we didn’t have pure sugar in the ancestral environment and we do now, so the actions that were good in the ancestral environment are a bad proxy for the actions that are good now. (Maybe just using a bad proxy should be called Goodhart Level 0) The point is that the reason that the environment is different now is that we optimized for sugar. We pushed to the section of possible worlds with lots of sugar using our sugar optimization, and the correlation mostly only existed in the worlds with a moderate amount of sugar.
Goodhart Level 3: Say we live in a world with only a fixed amount of nutrients, and someone else wants a larger share of the nutrients. If you are using the proxy of sugar, and other people know this, and adversary might invent candy, and then trade you their candy for some of your fruit. Another agent had a goal that was in conflict with your true goal, but not in conflict with your proxy, so they exploited your proxy and created options that satisfied your proxy but not your true goal.
I say more about this (in math) here: https://agentfoundations.org/item?id=1621
Many instances of people Goodharting themselves falls in Level 2 (If I don’t step on the scale, I optimize out of the worlds where the scale number is correlated with my weight). However, I claim that some instances might be at Level 3. In particular rationalization. Maybe part of me wants to save the world and uses the ability to produce justifications for why an action saves the world as a proxy for what saves the world. Then, a different part of me wants to goof off and produces a justification for why goofing off is actually the most world-saving action.
Awesome. Thanks for adding. I particularly like the inclusion of adversarial behavior into the mix—I hadn’t thought of the goal structure of the candymakers as undercutting/exploiting/taking advantage of the goal structure of the humans.