rotatingpaguro comments on When is Goodhart catastrophic?

rotatingpaguro 10 May 2023 21:02 UTC
LW: 3 AF: 2
0
AF
I wasn’t saying you made all those assumption, I was trying to imagine an empirical scenario to get your assumptions, and the first thing to come to my mind produced even stricter ones.
I do realize now that I messed up my comment when I wrote
in practice reduces just to the part “have a story for why inductive bias and/or non-independence work in your favor”, because I currently think Normality + additivity + independence are bad assumptions, and I see that as almost a null advice.
Here there should not be Normality, just additivity and independence, in the sense of $U - V ⊥ V$ . Sorry.
But I do expect that there are pretty similar-looking results where the independence assumption is substantially relaxed.
I do agree you could probably obtain similar-looking results with relaxed versions of the assumptions.
However, the same way $U - V ⊥ V$ seems quite specific to me, and you would need to make a convincing case that this is what you get in some realistic cases to make your theorem look useful, I expect this will continue to apply for whatever relaxed condition you can find that allows you to make a theorem.
Example: if you said “I made a version of the theorem assuming there exists $f$ such that $f (U, V) ⊥ V$ for $f$ in some class of functions”, I’d still ask “and in what realistic situations does such a setup arise, and why?”
- Thomas Kwa 10 May 2023 22:14 UTC
  4 points
  0
  Parent
  In my frame, $U$ is not just some variable correlated with $V$ , it’s some estimator’s best estimate, and so it makes sense that residuals $X = U - V$ would have various properties, for the same reason we consider residuals in statistics, returns in finance, etc.
  The basic idea why we might get $U - V ⊥ V$ is that there are some properties that increase the overseer’s rating and actually make the plan good (say, the plan includes a solution to the shutdown problem, interpretability, or whatever) and different properties that increase the overseer’s rating for no good reason (e.g. the plan uses really sophisticated words and an optimistic tone). I think assuming these are independent and additive is reasonable as a toy model, though as we said they’re probably violated in real life and we’re interested in weakening these assumptions.
  I guess you could get an elliptical distribution through something like this: all properties contribute to both $X$ and $V$ to some degree, and distribution of the angle is roughly uniform while the magnitudes are heavy-tailed. I’m not sure whether this is as natural as independence: if some property of the AI’s output makes the human irrationally approve of it (high $X$ ), then it seems likely to be optimized for that, rather than also having huge impacts on $V$ one way or the other.
  What links here?
  - Thomas Kwa's comment on On the lethality of biased human reward ratings by Eli Tyre (18 Nov 2023 20:39 UTC; 4 points)
  - rotatingpaguro 10 May 2023 23:50 UTC
    1 point
    0
    Parent
    if some property of the AI’s output makes the human irrationally approve of it (high $X$ ), then it seems likely to be optimized for that, rather than also having huge impacts on $V$ one way or the other.
    Are you saying that your (rough, preliminary) justification for independence is that it’s what gets you Goodhart, so you use it? Isn’t this circular? Ok so maybe I misinterpreted your intentions: I thought you wanted to “prove” that Goodhart happens, while possibly you wanted to “show an example” of Goodhart happening?
    - Thomas Kwa 11 May 2023 1:22 UTC
      4 points
      1
      Parent
      It doesn’t look circular to me? I’m not assuming that we get Goodhart, just that properties that result in very high X seem like they would be things like “very rhetorically persuasive” or “tricks the human into typing a very large number into the rating box” that won’t affect V much, rather than properties with very high magnitude towards both X and V. I believe this less for V, so we’ll probably have to replace independence with this.
      I think you’re splitting hairs. We prove Goodhart follows from certain assumptions, and I’ve given some justification for the assumptions as well as their limitations, so you could equally say that we “prove” or “show an example”. If by circular you mean we proved something about independent X and V because this was easier than more realistic assumptions, we’re guilty! The proof was a huge pain and we wanted to publish rather than overcomplicating it more, partly to get feedback like yours. But I do have some intuition that the result is useful, partly because things are sometimes approximately independent, and partly because the basic reasons behind the proof extend to other cases.
- Drake Thomas 11 May 2023 0:12 UTC
  LW: 2 AF: 2
  0
  AF Parent
  An example of the sort of strengthening I wouldn’t be surprised to see is something like “If $V$ is not too badly behaved in the following ways, and for all $v \in R$ we have [some light-tailedness condition] on the conditional distribution $(X | V = v)$ , then catastrophic Goodhart doesn’t happen.” This seems relaxed enough that you could actually encounter it in practice.
  What links here?
  - Thomas Kwa's comment on When is Goodhart catastrophic? by Drake Thomas (11 May 2023 1:22 UTC; 4 points)
  - Thomas Kwa 15 Nov 2023 21:12 UTC
    LW: 3 AF: 2
    0
    AF Parent
    Suppose that we are selecting for $U = X + V$ where V is true utility and X is error. If our estimator is unbiased ( $E [X | V = v] = 0$ for all v) and X is light-tailed conditional on any value of V, do we have ${lim}_{t \to \infty} E [V | X + V \geq t] = \infty$ ?
    No; here is a counterexample. Suppose that $V \sim N (0, 1)$ , and $X | V \sim N (0, 4)$ when $V \in [- 1, 1]$ , otherwise $X = 0$ . Then I think ${lim}_{t \to \infty} E [V | X + V \geq t] = 0$ .
    This is worrying because in the case where $V \sim N (0, 1)$ and $X \sim N (0, 4)$ independently, we do get infinite V. Merely making the error *smaller* for large values of V causes catastrophe. This suggests that success caused by light-tailed error when V has even lighter tails than X is fragile, and that these successes are “for the wrong reason”: they require a commensurate overestimate of the value when V is high as when V is low.
    What links here?
    Thomas Kwa’s research journal by Thomas Kwa (23 Nov 2023 5:11 UTC; 79 points)
    Thomas Kwa's comment on Catastrophic Goodhart in RL with KL penalty by Thomas Kwa (17 May 2024 22:00 UTC; 2 points)