...we’re talking about a class of problems that already comes up in all sorts of practical engineering, and which can be satisfactorily handled in many real cases without needing any philosophical advances.
The explicit assumption of the discussion here is that we can’t pass the full objective function to the subsystem—so it cannot possibly have the goal fully well defined. This isn’t going to depend on whether the subsystem is really smart or really dumb, it’s a fundamental problem if you can’t tell the subsystem enough to solve it.
But I don’t think that’s a fair characterization of most Goodhart-like problems, even in the limited practical case. Bad models and causal mistakes don’t get mitigated unless we get the correct model. And adversarial Goodhart is much worse than that. I agree that it describes “tails diverge” / regressional goodhart, and we have solutions for that case, (compute the Bayes estimate, as the previous ) but only once the goal is well-defined. (We have mitigations for other cases, but they have their own drawbacks.)
The explicit assumption of the discussion here is that we can’t pass the full objective function to the subsystem—so it cannot possibly have the goal fully well defined. This isn’t going to depend on whether the subsystem is really smart or really dumb, it’s a fundamental problem if you can’t tell the subsystem enough to solve it.
But I don’t think that’s a fair characterization of most Goodhart-like problems, even in the limited practical case. Bad models and causal mistakes don’t get mitigated unless we get the correct model. And adversarial Goodhart is much worse than that. I agree that it describes “tails diverge” / regressional goodhart, and we have solutions for that case, (compute the Bayes estimate, as the previous ) but only once the goal is well-defined. (We have mitigations for other cases, but they have their own drawbacks.)