First, thanks to both of you for writing this really nice paper. I have two questions, which are more about my understanding (or lack thereof) than about issues with the work.
My first question is some example I have in mind, and if my classification for it makes sense. In this example, the regulator is a grant-maker, and the agents are researchers in a field that is very theoretical. Our regulator wants to optimize for as much concrete applications as possible, regardless of what researchers are interested in (any similarity with the real world is obviously unintended). In order to reach this goal, the regulator will fund in priority grant-proposals promising applications. Yet the researchers could just write about applications while only pursuing the theory (still no intented similarity with the real world...).
It seems clear to me that this is an instance of adversarial goodhart, and more specifically of adversarial misalignment goodhart. Do you also think it is the case? If not, why?
My second question is more of a request: what “concrete” example could you give me of non-causal cobra effect goodhart? I have some trouble visualizing it.
First, thanks to both of you for writing this really nice paper. I have two questions, which are more about my understanding (or lack thereof) than about issues with the work.
My first question is some example I have in mind, and if my classification for it makes sense. In this example, the regulator is a grant-maker, and the agents are researchers in a field that is very theoretical. Our regulator wants to optimize for as much concrete applications as possible, regardless of what researchers are interested in (any similarity with the real world is obviously unintended). In order to reach this goal, the regulator will fund in priority grant-proposals promising applications. Yet the researchers could just write about applications while only pursuing the theory (still no intented similarity with the real world...).
It seems clear to me that this is an instance of adversarial goodhart, and more specifically of adversarial misalignment goodhart. Do you also think it is the case? If not, why?
My second question is more of a request: what “concrete” example could you give me of non-causal cobra effect goodhart? I have some trouble visualizing it.