Drake Thomas comments on When is Goodhart catastrophic?

Drake Thomas 9 May 2023 23:30 UTC
LW: 4 AF: 2
0
AF
I’m not sure what you mean formally by these assumptions, but I don’t think we’re making all of them. Certainly we aren’t assuming things are normally distributed—the post is in large part about how things change when we stop assuming normality! I also don’t think we’re making any assumptions with respect to additivity; $X = U - V$ is more of a notational or definitional choice, though as we’ve noted in the post it’s a framing that one could think doesn’t carve reality at the joints. (Perhaps you meant something different by additivity, though—feel free to clarify if I’ve misunderstood.)
Independence is absolutely a strong assumption here, and I’m interested in further explorations of how things play out in different non-independent regimes—in particular we’d be excited about theorems that could classify these dynamics under a moderately large space of non-independent distributions. But I do expect that there are pretty similar-looking results where the independence assumption is substantially relaxed. If that’s false, that would be interesting!

Late edit: Just a note that Thomas has now published a new post in the sequence addressing things from a non-independence POV.
- rotatingpaguro 10 May 2023 21:02 UTC
  LW: 3 AF: 2
  0
  AF Parent
  I wasn’t saying you made all those assumption, I was trying to imagine an empirical scenario to get your assumptions, and the first thing to come to my mind produced even stricter ones.
  I do realize now that I messed up my comment when I wrote
  in practice reduces just to the part “have a story for why inductive bias and/or non-independence work in your favor”, because I currently think Normality + additivity + independence are bad assumptions, and I see that as almost a null advice.
  Here there should not be Normality, just additivity and independence, in the sense of $U - V ⊥ V$ . Sorry.
  But I do expect that there are pretty similar-looking results where the independence assumption is substantially relaxed.
  I do agree you could probably obtain similar-looking results with relaxed versions of the assumptions.
  However, the same way $U - V ⊥ V$ seems quite specific to me, and you would need to make a convincing case that this is what you get in some realistic cases to make your theorem look useful, I expect this will continue to apply for whatever relaxed condition you can find that allows you to make a theorem.
  Example: if you said “I made a version of the theorem assuming there exists $f$ such that $f (U, V) ⊥ V$ for $f$ in some class of functions”, I’d still ask “and in what realistic situations does such a setup arise, and why?”
  - Thomas Kwa 10 May 2023 22:14 UTC
    4 points
    0
    Parent
    In my frame, $U$ is not just some variable correlated with $V$ , it’s some estimator’s best estimate, and so it makes sense that residuals $X = U - V$ would have various properties, for the same reason we consider residuals in statistics, returns in finance, etc.
    The basic idea why we might get $U - V ⊥ V$ is that there are some properties that increase the overseer’s rating and actually make the plan good (say, the plan includes a solution to the shutdown problem, interpretability, or whatever) and different properties that increase the overseer’s rating for no good reason (e.g. the plan uses really sophisticated words and an optimistic tone). I think assuming these are independent and additive is reasonable as a toy model, though as we said they’re probably violated in real life and we’re interested in weakening these assumptions.
    I guess you could get an elliptical distribution through something like this: all properties contribute to both $X$ and $V$ to some degree, and distribution of the angle is roughly uniform while the magnitudes are heavy-tailed. I’m not sure whether this is as natural as independence: if some property of the AI’s output makes the human irrationally approve of it (high $X$ ), then it seems likely to be optimized for that, rather than also having huge impacts on $V$ one way or the other.
    What links here?
    Thomas Kwa's comment on On the lethality of biased human reward ratings by Eli Tyre (18 Nov 2023 20:39 UTC; 4 points)
    - rotatingpaguro 10 May 2023 23:50 UTC
      1 point
      0
      Parent
      if some property of the AI’s output makes the human irrationally approve of it (high $X$ ), then it seems likely to be optimized for that, rather than also having huge impacts on $V$ one way or the other.
      Are you saying that your (rough, preliminary) justification for independence is that it’s what gets you Goodhart, so you use it? Isn’t this circular? Ok so maybe I misinterpreted your intentions: I thought you wanted to “prove” that Goodhart happens, while possibly you wanted to “show an example” of Goodhart happening?
      - Thomas Kwa 11 May 2023 1:22 UTC
        4 points
        1
        Parent
        It doesn’t look circular to me? I’m not assuming that we get Goodhart, just that properties that result in very high X seem like they would be things like “very rhetorically persuasive” or “tricks the human into typing a very large number into the rating box” that won’t affect V much, rather than properties with very high magnitude towards both X and V. I believe this less for V, so we’ll probably have to replace independence with this.
        I think you’re splitting hairs. We prove Goodhart follows from certain assumptions, and I’ve given some justification for the assumptions as well as their limitations, so you could equally say that we “prove” or “show an example”. If by circular you mean we proved something about independent X and V because this was easier than more realistic assumptions, we’re guilty! The proof was a huge pain and we wanted to publish rather than overcomplicating it more, partly to get feedback like yours. But I do have some intuition that the result is useful, partly because things are sometimes approximately independent, and partly because the basic reasons behind the proof extend to other cases.
  - Drake Thomas 11 May 2023 0:12 UTC
    LW: 2 AF: 2
    0
    AF Parent
    An example of the sort of strengthening I wouldn’t be surprised to see is something like “If $V$ is not too badly behaved in the following ways, and for all $v \in R$ we have [some light-tailedness condition] on the conditional distribution $(X | V = v)$ , then catastrophic Goodhart doesn’t happen.” This seems relaxed enough that you could actually encounter it in practice.
    What links here?
    Thomas Kwa's comment on When is Goodhart catastrophic? by Drake Thomas (11 May 2023 1:22 UTC; 4 points)
    - Thomas Kwa 15 Nov 2023 21:12 UTC
      LW: 3 AF: 2
      0
      AF Parent
      Suppose that we are selecting for $U = X + V$ where V is true utility and X is error. If our estimator is unbiased ( $E [X | V = v] = 0$ for all v) and X is light-tailed conditional on any value of V, do we have ${lim}_{t \to \infty} E [V | X + V \geq t] = \infty$ ?
      No; here is a counterexample. Suppose that $V \sim N (0, 1)$ , and $X | V \sim N (0, 4)$ when $V \in [- 1, 1]$ , otherwise $X = 0$ . Then I think ${lim}_{t \to \infty} E [V | X + V \geq t] = 0$ .
      This is worrying because in the case where $V \sim N (0, 1)$ and $X \sim N (0, 4)$ independently, we do get infinite V. Merely making the error *smaller* for large values of V causes catastrophe. This suggests that success caused by light-tailed error when V has even lighter tails than X is fragile, and that these successes are “for the wrong reason”: they require a commensurate overestimate of the value when V is high as when V is low.
      What links here?
      Thomas Kwa’s research journal by Thomas Kwa (23 Nov 2023 5:11 UTC; 79 points)
      Thomas Kwa's comment on Catastrophic Goodhart in RL with KL penalty by Thomas Kwa (17 May 2024 22:00 UTC; 2 points)