Evaluation is not actually easier than generation, when Goodhart is the main problem to begin with.
I think it’s very unclear how big a problem Goodhart is for alignment research—it seems like a question about a particular technical domain. There are domains where evaluation is much easier; most obviously mathematics, but also in e.g. physics or computer science, there are massive gaps between recognition and generation even if you don’t have formal theorem statements. There are also domains where it’s not much easier, where the whole thing rests on complicated judgments where the search for clever arguments just isn’t doing much work.
It looks to me like alignment is somewhere in the middle, though it’s not at all clear—right now there are different strands of alignment progress, which seem to have very different properties with respect to the ease of evaluation.
The kind of Goodhart we are usually concerned about is stuff like “it’s easier to hijack the reward signal than to actually perform a challenging task,” and I don’t think that’s very tightly correlated with the question about alignment. So this feels like the rhetoric here involves a bit of an equivocation.
I think it’s very unclear how big a problem Goodhart is for alignment research—it seems like a question about a particular technical domain.
Just a couple weeks ago I had this post talking about how, in some technical areas, we’ve been able to find very robust formulations of particular concepts (i.e. “True Names”). The domains where evaluation is much easier—math, physics, CS—are the domains where we have those robust formulations. Even within e.g. physics, evaluation stops being easy when we’re in a domain where we don’t have a robust mathematical formulation of the phenomena of interest.
The other point of that post is that we do not currently have such formulations for the phenomena of interest in alignment, and (one way of framing) the point of foundational agency research is to find them.
So I agree that the difficulty of evaluation varies by domain, but I don’t think it’s some mysterious hard-to-predict thing. The places where robust evaluation is easy all build on qualitatively-similar foundational pieces, and alignment does not yet have those sorts of building blocks.
The kind of Goodhart we are usually concerned about is stuff like “it’s easier to hijack the reward signal than to actually perform a challenging task,” and I don’t think that’s very tightly correlated with the question about alignment. So this feels like the rhetoric here involves a bit of an equivocation.
Go take a look at that other post, it has two good examples of how Goodhart shows up as a central barrier to alignment.
I don’t buy the empirical claim about when recognition is easier than generation. As an example, I think that you can recognize robust formulations much more easily than you can generate them in math, computer science, and physics. In general I think “recognition is not trivial” is different from “recognition is as hard as generation.”
I found this comment pretty convincing. Alignment has been compared to philosophy, which seems at the opposite end of “the fuzziness spectrum” as math and physics. And it does seem like concept fuzziness would make evaluation harder.
I’ll note though that ARC’s approach to alignment seems more math-problem-flavored than yours, which might be a source of disagreement between you two (since maybe you conceptualize what it means to work on alignment differently).
I think it’s very unclear how big a problem Goodhart is for alignment research—it seems like a question about a particular technical domain. There are domains where evaluation is much easier; most obviously mathematics, but also in e.g. physics or computer science, there are massive gaps between recognition and generation even if you don’t have formal theorem statements. There are also domains where it’s not much easier, where the whole thing rests on complicated judgments where the search for clever arguments just isn’t doing much work.
It looks to me like alignment is somewhere in the middle, though it’s not at all clear—right now there are different strands of alignment progress, which seem to have very different properties with respect to the ease of evaluation.
The kind of Goodhart we are usually concerned about is stuff like “it’s easier to hijack the reward signal than to actually perform a challenging task,” and I don’t think that’s very tightly correlated with the question about alignment. So this feels like the rhetoric here involves a bit of an equivocation.
Just a couple weeks ago I had this post talking about how, in some technical areas, we’ve been able to find very robust formulations of particular concepts (i.e. “True Names”). The domains where evaluation is much easier—math, physics, CS—are the domains where we have those robust formulations. Even within e.g. physics, evaluation stops being easy when we’re in a domain where we don’t have a robust mathematical formulation of the phenomena of interest.
The other point of that post is that we do not currently have such formulations for the phenomena of interest in alignment, and (one way of framing) the point of foundational agency research is to find them.
So I agree that the difficulty of evaluation varies by domain, but I don’t think it’s some mysterious hard-to-predict thing. The places where robust evaluation is easy all build on qualitatively-similar foundational pieces, and alignment does not yet have those sorts of building blocks.
Go take a look at that other post, it has two good examples of how Goodhart shows up as a central barrier to alignment.
I don’t buy the empirical claim about when recognition is easier than generation. As an example, I think that you can recognize robust formulations much more easily than you can generate them in math, computer science, and physics. In general I think “recognition is not trivial” is different from “recognition is as hard as generation.”
I found this comment pretty convincing. Alignment has been compared to philosophy, which seems at the opposite end of “the fuzziness spectrum” as math and physics. And it does seem like concept fuzziness would make evaluation harder.
I’ll note though that ARC’s approach to alignment seems more math-problem-flavored than yours, which might be a source of disagreement between you two (since maybe you conceptualize what it means to work on alignment differently).