I missed the proposal when it was first released, but I wanted to note that the original proposal addresses only one (critical) class of Goodhart-error, and proposes a strategy based on addressing one problematic result of that, nearest-unblocked neighbor. The strategy does more widely useful for misspecification than just nearest-unblocked neighbor, but it still is only addressing some Goodhart-effects.
The misspecification discussed is more closely related to, but still distinct from, extremal and regressional Goodhart. (Causal and adversarial Goodhart are somewhat far removed, and don’t seem as relevant to me here. Causal Goodhart is due to mistakes, albeit fundamentally hard to avoid mistakes, while adversarial Goodhart happens via exploiting other modes of failure.)
I notice I am confused about how different strategies being proposed to mitigate these related failures can coexist if each is implemented separately, and/or how they would be balanced if implemented together, as I briefly outline below. Reconciling or balancing these different strategies seems like an important question, but I want to wait to see the full research agenda before commenting or questioning further.
Explaining the conflict I see between the strategies:
Extremal Goodhart is somewhat addressed by another post you made, which proposes to avoid ambiguous distant situations—https://www.lesswrong.com/posts/PX8BB7Rqw7HedrSJd/by-default-avoid-ambiguous-distant-situations. It seems that the strategy proposed here is to attempt to resolve fuzziness, rather than avoid areas where it becomes critical. These seem to be at least somewhat at odds, though this is partly reconcilable by fully pursuing neither resolving ambiguity, nor fully avoiding distant ambiguity.
and regressional Goodhart, as Scott G. originally pointed out, is unavoidable except by staying in-sample, interpolating rather than extrapolating. Fully pursuing that strategy is precluded by injecting uncertainty into the model of the Human-provided modification to the utility function. Again, this is partly reconcilable, for example, by trying to bound how far we let the system stray from the initially provided blocked strategy, and how much fuzziness it is allowed to infer without an external check.
I missed the proposal when it was first released, but I wanted to note that the original proposal addresses only one (critical) class of Goodhart-error, and proposes a strategy based on addressing one problematic result of that, nearest-unblocked neighbor. The strategy does more widely useful for misspecification than just nearest-unblocked neighbor, but it still is only addressing some Goodhart-effects.
The misspecification discussed is more closely related to, but still distinct from, extremal and regressional Goodhart. (Causal and adversarial Goodhart are somewhat far removed, and don’t seem as relevant to me here. Causal Goodhart is due to mistakes, albeit fundamentally hard to avoid mistakes, while adversarial Goodhart happens via exploiting other modes of failure.)
I notice I am confused about how different strategies being proposed to mitigate these related failures can coexist if each is implemented separately, and/or how they would be balanced if implemented together, as I briefly outline below. Reconciling or balancing these different strategies seems like an important question, but I want to wait to see the full research agenda before commenting or questioning further.
Explaining the conflict I see between the strategies:
Extremal Goodhart is somewhat addressed by another post you made, which proposes to avoid ambiguous distant situations—https://www.lesswrong.com/posts/PX8BB7Rqw7HedrSJd/by-default-avoid-ambiguous-distant-situations. It seems that the strategy proposed here is to attempt to resolve fuzziness, rather than avoid areas where it becomes critical. These seem to be at least somewhat at odds, though this is partly reconcilable by fully pursuing neither resolving ambiguity, nor fully avoiding distant ambiguity.
and regressional Goodhart, as Scott G. originally pointed out, is unavoidable except by staying in-sample, interpolating rather than extrapolating. Fully pursuing that strategy is precluded by injecting uncertainty into the model of the Human-provided modification to the utility function. Again, this is partly reconcilable, for example, by trying to bound how far we let the system stray from the initially provided blocked strategy, and how much fuzziness it is allowed to infer without an external check.