Looking for “functions that don’t exhibit Goodhart effects under extreme optimization” might be a promising area to look into. What does it mean for a function to behave as expected under extreme optimization? Can you give a toy example?
I’m actually not really sure. We have some vague notion that, for example, my preference for eating pizza shouldn’t result in attempts at unbounded pizza eating maximization, and I would probably be unhappy from my current values if a maximizing agent saw I liked pizza the best of all foods and then proceeded to feed me only pizza forever, even if it modified me such that I would maximally enjoy the pizza each time and not get bored of it.
Thinking more in terms of regressional Goodharting, maybe something like not deviating from the true target because of optimizing for the measure of it. Consider the classic rat extermination example of Goodharting. We already know collecting rat tails as evidence of extermination is a function that leads to weird effects. Does there exist a function that measures rat exterminations that, when optimized for, produces the intended effect (extermination of rats) without doing anything “weird”, e.g. generating unintended side-effects, maximizing rat reproduction so we can exterminate more of them, just straightforwardly leads to the extinction of rats and nothing else.
Right, that’s the question. Sure, it is easy to state that “metric must be a faithful representation of the target”, but it never is, is it? From the point of view of double inversion, optimizing the target is a hard inverse problem, because, like in your pizza example, the true “values” (pizza is a preference on the background of an otherwise balanced diet) is not easily observable. What would be a double inverse in this case? Maybe something like trying various amounts of pizza and getting the feedback on enjoyment? That would match the long division pattern. I’m not sure.
Looking for “functions that don’t exhibit Goodhart effects under extreme optimization” might be a promising area to look into. What does it mean for a function to behave as expected under extreme optimization? Can you give a toy example?
I’m actually not really sure. We have some vague notion that, for example, my preference for eating pizza shouldn’t result in attempts at unbounded pizza eating maximization, and I would probably be unhappy from my current values if a maximizing agent saw I liked pizza the best of all foods and then proceeded to feed me only pizza forever, even if it modified me such that I would maximally enjoy the pizza each time and not get bored of it.
Thinking more in terms of regressional Goodharting, maybe something like not deviating from the true target because of optimizing for the measure of it. Consider the classic rat extermination example of Goodharting. We already know collecting rat tails as evidence of extermination is a function that leads to weird effects. Does there exist a function that measures rat exterminations that, when optimized for, produces the intended effect (extermination of rats) without doing anything “weird”, e.g. generating unintended side-effects, maximizing rat reproduction so we can exterminate more of them, just straightforwardly leads to the extinction of rats and nothing else.
Right, that’s the question. Sure, it is easy to state that “metric must be a faithful representation of the target”, but it never is, is it? From the point of view of double inversion, optimizing the target is a hard inverse problem, because, like in your pizza example, the true “values” (pizza is a preference on the background of an otherwise balanced diet) is not easily observable. What would be a double inverse in this case? Maybe something like trying various amounts of pizza and getting the feedback on enjoyment? That would match the long division pattern. I’m not sure.