I think it’s very unclear how big a problem Goodhart is for alignment research—it seems like a question about a particular technical domain.
Just a couple weeks ago I had this post talking about how, in some technical areas, we’ve been able to find very robust formulations of particular concepts (i.e. “True Names”). The domains where evaluation is much easier—math, physics, CS—are the domains where we have those robust formulations. Even within e.g. physics, evaluation stops being easy when we’re in a domain where we don’t have a robust mathematical formulation of the phenomena of interest.
The other point of that post is that we do not currently have such formulations for the phenomena of interest in alignment, and (one way of framing) the point of foundational agency research is to find them.
So I agree that the difficulty of evaluation varies by domain, but I don’t think it’s some mysterious hard-to-predict thing. The places where robust evaluation is easy all build on qualitatively-similar foundational pieces, and alignment does not yet have those sorts of building blocks.
The kind of Goodhart we are usually concerned about is stuff like “it’s easier to hijack the reward signal than to actually perform a challenging task,” and I don’t think that’s very tightly correlated with the question about alignment. So this feels like the rhetoric here involves a bit of an equivocation.
Go take a look at that other post, it has two good examples of how Goodhart shows up as a central barrier to alignment.
I don’t buy the empirical claim about when recognition is easier than generation. As an example, I think that you can recognize robust formulations much more easily than you can generate them in math, computer science, and physics. In general I think “recognition is not trivial” is different from “recognition is as hard as generation.”
I found this comment pretty convincing. Alignment has been compared to philosophy, which seems at the opposite end of “the fuzziness spectrum” as math and physics. And it does seem like concept fuzziness would make evaluation harder.
I’ll note though that ARC’s approach to alignment seems more math-problem-flavored than yours, which might be a source of disagreement between you two (since maybe you conceptualize what it means to work on alignment differently).
Just a couple weeks ago I had this post talking about how, in some technical areas, we’ve been able to find very robust formulations of particular concepts (i.e. “True Names”). The domains where evaluation is much easier—math, physics, CS—are the domains where we have those robust formulations. Even within e.g. physics, evaluation stops being easy when we’re in a domain where we don’t have a robust mathematical formulation of the phenomena of interest.
The other point of that post is that we do not currently have such formulations for the phenomena of interest in alignment, and (one way of framing) the point of foundational agency research is to find them.
So I agree that the difficulty of evaluation varies by domain, but I don’t think it’s some mysterious hard-to-predict thing. The places where robust evaluation is easy all build on qualitatively-similar foundational pieces, and alignment does not yet have those sorts of building blocks.
Go take a look at that other post, it has two good examples of how Goodhart shows up as a central barrier to alignment.
I don’t buy the empirical claim about when recognition is easier than generation. As an example, I think that you can recognize robust formulations much more easily than you can generate them in math, computer science, and physics. In general I think “recognition is not trivial” is different from “recognition is as hard as generation.”
I found this comment pretty convincing. Alignment has been compared to philosophy, which seems at the opposite end of “the fuzziness spectrum” as math and physics. And it does seem like concept fuzziness would make evaluation harder.
I’ll note though that ARC’s approach to alignment seems more math-problem-flavored than yours, which might be a source of disagreement between you two (since maybe you conceptualize what it means to work on alignment differently).