In my frame, U is not just some variable correlated with V, it’s some estimator’s best estimate, and so it makes sense that residuals X=U−V would have various properties, for the same reason we consider residuals in statistics, returns in finance, etc.
The basic idea why we might get U−V⊥V is that there are some properties that increase the overseer’s rating and actually make the plan good (say, the plan includes a solution to the shutdown problem, interpretability, or whatever) and different properties that increase the overseer’s rating for no good reason (e.g. the plan uses really sophisticated words and an optimistic tone). I think assuming these are independent and additive is reasonable as a toy model, though as we said they’re probably violated in real life and we’re interested in weakening these assumptions.
I guess you could get an elliptical distribution through something like this: all properties contribute to both X and V to some degree, and distribution of the angle is roughly uniform while the magnitudes are heavy-tailed. I’m not sure whether this is as natural as independence: if some property of the AI’s output makes the human irrationally approve of it (high X), then it seems likely to be optimized for that, rather than also having huge impacts on V one way or the other.
if some property of the AI’s output makes the human irrationally approve of it (high X), then it seems likely to be optimized for that, rather than also having huge impacts on V one way or the other.
Are you saying that your (rough, preliminary) justification for independence is that it’s what gets you Goodhart, so you use it? Isn’t this circular? Ok so maybe I misinterpreted your intentions: I thought you wanted to “prove” that Goodhart happens, while possibly you wanted to “show an example” of Goodhart happening?
It doesn’t look circular to me? I’m not assuming that we get Goodhart, just that properties that result in very high X seem like they would be things like “very rhetorically persuasive” or “tricks the human into typing a very large number into the rating box” that won’t affect V much, rather than properties with very high magnitude towards both X and V. I believe this less for V, so we’ll probably have to replace independence with this.
I think you’re splitting hairs. We prove Goodhart follows from certain assumptions, and I’ve given some justification for the assumptions as well as their limitations, so you could equally say that we “prove” or “show an example”. If by circular you mean we proved something about independent X and V because this was easier than more realistic assumptions, we’re guilty! The proof was a huge pain and we wanted to publish rather than overcomplicating it more, partly to get feedback like yours. But I do have some intuition that the result is useful, partly because things are sometimes approximately independent, and partly because the basic reasons behind the proof extend to other cases.
In my frame, U is not just some variable correlated with V, it’s some estimator’s best estimate, and so it makes sense that residuals X=U−V would have various properties, for the same reason we consider residuals in statistics, returns in finance, etc.
The basic idea why we might get U−V⊥V is that there are some properties that increase the overseer’s rating and actually make the plan good (say, the plan includes a solution to the shutdown problem, interpretability, or whatever) and different properties that increase the overseer’s rating for no good reason (e.g. the plan uses really sophisticated words and an optimistic tone). I think assuming these are independent and additive is reasonable as a toy model, though as we said they’re probably violated in real life and we’re interested in weakening these assumptions.
I guess you could get an elliptical distribution through something like this: all properties contribute to both X and V to some degree, and distribution of the angle is roughly uniform while the magnitudes are heavy-tailed. I’m not sure whether this is as natural as independence: if some property of the AI’s output makes the human irrationally approve of it (high X), then it seems likely to be optimized for that, rather than also having huge impacts on V one way or the other.
Are you saying that your (rough, preliminary) justification for independence is that it’s what gets you Goodhart, so you use it? Isn’t this circular? Ok so maybe I misinterpreted your intentions: I thought you wanted to “prove” that Goodhart happens, while possibly you wanted to “show an example” of Goodhart happening?
It doesn’t look circular to me? I’m not assuming that we get Goodhart, just that properties that result in very high X seem like they would be things like “very rhetorically persuasive” or “tricks the human into typing a very large number into the rating box” that won’t affect V much, rather than properties with very high magnitude towards both X and V. I believe this less for V, so we’ll probably have to replace independence with this.
I think you’re splitting hairs. We prove Goodhart follows from certain assumptions, and I’ve given some justification for the assumptions as well as their limitations, so you could equally say that we “prove” or “show an example”. If by circular you mean we proved something about independent X and V because this was easier than more realistic assumptions, we’re guilty! The proof was a huge pain and we wanted to publish rather than overcomplicating it more, partly to get feedback like yours. But I do have some intuition that the result is useful, partly because things are sometimes approximately independent, and partly because the basic reasons behind the proof extend to other cases.