I’m unsure whether the problems addressed in your paper are sufficient for resolving the causal Goodhart concerns, since I need to think much more about the way the reward function is defined, but it seems it might not. This question is really important for the follow-on work on adversarial Goodhart, and I’m still trying to figure out how to characterize the metrics / reward functions that are and are not susceptible to corruption in this way. Perhaps a cryptographic approach solve parts of the problem
Very interesting work!
You might like my new post that explains why I think only Oracles can resolve the problems of Causal Goodhart-like issues; https://www.lesserwrong.com/posts/iK2F9QDZvwWinsBYB/non-adversarial-goodhart-and-ai-risks
I’m unsure whether the problems addressed in your paper are sufficient for resolving the causal Goodhart concerns, since I need to think much more about the way the reward function is defined, but it seems it might not. This question is really important for the follow-on work on adversarial Goodhart, and I’m still trying to figure out how to characterize the metrics / reward functions that are and are not susceptible to corruption in this way. Perhaps a cryptographic approach solve parts of the problem