I haven’t gotten into MSFP or the like, but I like helping with the world’s plot anyway. I’ve won 25$ in the Oracle contest, so I’ve told Stuart to put it towards further public contribution opportunities such as contests. He suggested that I do a small contest myself, for the 25$ and another 25$ from him.
So, selfish as I am, in one month, the 50$ go to the highest-karma answer, and another 50$ from me go to the answer that actually gets me to contribute. (In case of ambiguity on the latter, I try to award the cause of my best contribution. AF karma would be a good measure, except maybe the opportunities aren’t all on AF...)
Examples of where I’m likely to contribute can be found via: https://www.alignmentforum.org/users/gurkenglas
Note how 25$ of prize money turn out worth more than 25$ in your pocket :)
Edit: Contest is over, contestant wins 50 dollars by walkover.
I’d be really happy if someone were to figure out how to clearly characterize which Goodhart failure mode is occurring in a toy world with simple optimizers. (Bonus: and also look at what types of agents do or do not display the different failure modes.)
For example, imagine you have a blockworld, where the agent is supposed to push blocks to a goal, and is scored based on distance from the goal. It would be good to have a clear way to delineate which failures can / do occur, and provide the failure category.
A change in regime failure might happen if the agent finds a strategy that works in the training world, where, say, you are only supposed to push the blocks right, and the goal is against the right wall, but in the test set the goal is elsewhere.
An extremal Goodhart failure might be that the training world is 10x10, and in the test set there is a 20x20 world, and the agent stops pushing after moving it 10 blocks.
A causal Goodhart failure might be if the goal is movable, and the agent accidentally pushes it away from where it moves the blocks towards.