I’d be really happy if someone were to figure out how to clearly characterize which Goodhart failure mode is occurring in a toy world with simple optimizers. (Bonus: and also look at what types of agents do or do not display the different failure modes.)
For example, imagine you have a blockworld, where the agent is supposed to push blocks to a goal, and is scored based on distance from the goal. It would be good to have a clear way to delineate which failures can / do occur, and provide the failure category.
A change in regime failure might happen if the agent finds a strategy that works in the training world, where, say, you are only supposed to push the blocks right, and the goal is against the right wall, but in the test set the goal is elsewhere.
An extremal Goodhart failure might be that the training world is 10x10, and in the test set there is a 20x20 world, and the agent stops pushing after moving it 10 blocks.
A causal Goodhart failure might be if the goal is movable, and the agent accidentally pushes it away from where it moves the blocks towards.
I’d be really happy if someone were to figure out how to clearly characterize which Goodhart failure mode is occurring in a toy world with simple optimizers. (Bonus: and also look at what types of agents do or do not display the different failure modes.)
For example, imagine you have a blockworld, where the agent is supposed to push blocks to a goal, and is scored based on distance from the goal. It would be good to have a clear way to delineate which failures can / do occur, and provide the failure category.
A change in regime failure might happen if the agent finds a strategy that works in the training world, where, say, you are only supposed to push the blocks right, and the goal is against the right wall, but in the test set the goal is elsewhere.
An extremal Goodhart failure might be that the training world is 10x10, and in the test set there is a 20x20 world, and the agent stops pushing after moving it 10 blocks.
A causal Goodhart failure might be if the goal is movable, and the agent accidentally pushes it away from where it moves the blocks towards.