I tentatively think it makes the most sense to apply Goodhart exclusively to selection processes rather than control, except perhaps for the causal case.
All flavors of Goodhart require a system to be clearly optimizing a proxy. Selection-style optimization gets direct feedback of some kind; this feedback can be “the true function” or “the proxy”—although calling it either requires anthropomorphic analysis of what the system is trying to do—perhaps via observation of how/why the system was set up. So we can talk about Goodhart by comparing what the selection process explicitly optimizes to our anthropomorphic analysis of what it “tries to” optimize.
A control system, on the other hand, is all anthropomorphism. We look at it efficiently steering the world into a narrow space of possibility, and we conclude that’s what it is “trying” to do. So where is the true-vs-proxy to compare?
It might be that we have two or more plausible ways to ascribe goals to the controller, and that these conflict. For example, maybe the controller happens to have an explicit specification of a utility function somewhere in its machine code, and on the other hand, we know that it was build for a specific purpose—and the two differ from each other. This is simple value misalignment (arguably outright misspecification, not Goodhart?).
However, I can’t think of a reason why the thing would need an explicit utility function inside of it unless it was internally implementing a selection process, like a planning algorithm or something. So that brings us back to applying the Goodhart concept to selection processes, rather than control.
You mention model errors. If the controller is internally using model-based reasoning, it seems very likely that it is doing selection-style planning. So again, the Goodhart concept seems to apply to the selection part.
There are some situations where we don’t need to apply any anthropomorphic analysis to infer a goal for a controller, because it is definitely responsive to a specific form of feedback: namely, reinforcement learning. In the case of reinforcement learning, a “goodhart-ish” failure which can occur is wireheading. Interesting that I have never been quite comfortable classifying wireheading as one of the four types of Goodhart; perhaps that’s because I was applying the selection-vs-control distinction implicitly.
I mentioned that causal goodhart might be the exception. It seems to me that causal failure applies to selection processes too, but ONLY to selection which is being used to implement a controller. In effect it’s a type of model error for a model-based controller.
All of this is vague and fuzzy and only weakly endorsed.
Thanks for the feedback. I agree that in a control system, any divergence between intent and outcome is an alignment issue, and I agree that this makes overoptimization different in control versus selection. Despite the conceptual confusion, I definitely think the connections are worth noting—not only “wireheading,” but the issues with mesa-optimizers. And I definitely think that causal failures are important particularly in this context.
But I strongly endorse how weak and fuzzy this is—which is a large part of why I wanted to try to de-confuse myself. That’s the goal of this mini-sequence, and I hope that doing so publicly in this way at least highlights where the confusion is, even if I can’t successfully de-confuse myself, much less others. And if there are places where others are materially less confused than me and/or you, I’d love for them to write responses or their own explainers on this.
I think I already want to back off on my assertion that the categories should not be applied to controllers. However, I see the application to controllers as more complex. It’s more clear what it means to (successfully) point a selection-style optimization process at a proxy. In a selection setting, you have the proxy (which the system can access), and the true value (which is not accessible). Wireheading only makes sense when “true” is partially accessible, and the agent severs that connection.
I definitely appreciate your posts on this; it hadn’t occurred to me to ask whether the four types apply equally well to selection and control.
I tentatively think it makes the most sense to apply Goodhart exclusively to selection processes rather than control, except perhaps for the causal case.
All flavors of Goodhart require a system to be clearly optimizing a proxy. Selection-style optimization gets direct feedback of some kind; this feedback can be “the true function” or “the proxy”—although calling it either requires anthropomorphic analysis of what the system is trying to do—perhaps via observation of how/why the system was set up. So we can talk about Goodhart by comparing what the selection process explicitly optimizes to our anthropomorphic analysis of what it “tries to” optimize.
A control system, on the other hand, is all anthropomorphism. We look at it efficiently steering the world into a narrow space of possibility, and we conclude that’s what it is “trying” to do. So where is the true-vs-proxy to compare?
It might be that we have two or more plausible ways to ascribe goals to the controller, and that these conflict. For example, maybe the controller happens to have an explicit specification of a utility function somewhere in its machine code, and on the other hand, we know that it was build for a specific purpose—and the two differ from each other. This is simple value misalignment (arguably outright misspecification, not Goodhart?).
However, I can’t think of a reason why the thing would need an explicit utility function inside of it unless it was internally implementing a selection process, like a planning algorithm or something. So that brings us back to applying the Goodhart concept to selection processes, rather than control.
You mention model errors. If the controller is internally using model-based reasoning, it seems very likely that it is doing selection-style planning. So again, the Goodhart concept seems to apply to the selection part.
There are some situations where we don’t need to apply any anthropomorphic analysis to infer a goal for a controller, because it is definitely responsive to a specific form of feedback: namely, reinforcement learning. In the case of reinforcement learning, a “goodhart-ish” failure which can occur is wireheading. Interesting that I have never been quite comfortable classifying wireheading as one of the four types of Goodhart; perhaps that’s because I was applying the selection-vs-control distinction implicitly.
I mentioned that causal goodhart might be the exception. It seems to me that causal failure applies to selection processes too, but ONLY to selection which is being used to implement a controller. In effect it’s a type of model error for a model-based controller.
All of this is vague and fuzzy and only weakly endorsed.
Thanks for the feedback. I agree that in a control system, any divergence between intent and outcome is an alignment issue, and I agree that this makes overoptimization different in control versus selection. Despite the conceptual confusion, I definitely think the connections are worth noting—not only “wireheading,” but the issues with mesa-optimizers. And I definitely think that causal failures are important particularly in this context.
But I strongly endorse how weak and fuzzy this is—which is a large part of why I wanted to try to de-confuse myself. That’s the goal of this mini-sequence, and I hope that doing so publicly in this way at least highlights where the confusion is, even if I can’t successfully de-confuse myself, much less others. And if there are places where others are materially less confused than me and/or you, I’d love for them to write responses or their own explainers on this.
I think I already want to back off on my assertion that the categories should not be applied to controllers. However, I see the application to controllers as more complex. It’s more clear what it means to (successfully) point a selection-style optimization process at a proxy. In a selection setting, you have the proxy (which the system can access), and the true value (which is not accessible). Wireheading only makes sense when “true” is partially accessible, and the agent severs that connection.
I definitely appreciate your posts on this; it hadn’t occurred to me to ask whether the four types apply equally well to selection and control.