So the reward model doesn’t need to be an exact, high-fidelity representation. An approximation is fine, “a little off” is fine, but it needs to be approximately-correct everywhere.
This is not quite true. If you select infinitely hard for high values of a proxy U = X+V where V is true utility and X is error, you get infinite utility in expectation if utility is easier to optimize for (has heavier tails) than error. There are even cases where you get infinite utility despite error having heavier tails than utility, like if error and true utility are independent and both are light-tailed.
Drake Thomas and I proved theorems about this here, and there might be another post coming soon about the nonindependent case.
I think I’m not super into the U = V + X framing; that seems to inherently suggest that there exists some component of the true utility V “inside” the proxy U everywhere, and which is merely perturbed by some error term rather than washed out entirely (in the manner I’d expect to see from an actual misspecification). In a lot of the classic Goodhart cases, the source of the divergence between measurement and desideratum isn’t regressional, and so V and X aren’t independent.
(Consider e.g. two arbitrary functions U’ and V’, and compute the “error term” X’ between them. It should be obvious that when U’ is maximized, X’ is much more likely to be large than V’ is; which is simply another way of saying that X’ isn’t independent of V’, since it was in fact computed from V’ (and U’). The claim that the reward model isn’t even “approximately correct”, then, is basically this: that there is a separate function U being optimized whose correlation with V within-distribution is in some sense coincidental, and that out-of-distribution the two become basically unrelated, rather than one being expressible as a function of the other plus some well-behaved error term.)
I think independence is probably the biggest weakness of the post just because it’s an extremely strong assumption, but I have reasons why the U = V+X framing is natural here. The error term X has a natural meaning in the case where some additive terms of V are not captured in U (e.g. because they only exist off the training distribution), or some additive terms of U are not in V (e.g. because they’re ways to trick the overseer).
The example of two arbitrary functions doesn’t seem very central because it seems to me that if we train U to approximate V, its correlation in distribution will be due to the presence of features in the data, rather than being coincidental. Maybe the features won’t be additive or independent and we should think about those cases though. It still seems possible to prove things if you weaken independence to unbiasedness.
Agree that we currently only analyze regressional and perhaps extremal Goodhart; people should be thinking about the other two as well.
This is not quite true. If you select infinitely hard for high values of a proxy U = X+V where V is true utility and X is error, you get infinite utility in expectation if utility is easier to optimize for (has heavier tails) than error. There are even cases where you get infinite utility despite error having heavier tails than utility, like if error and true utility are independent and both are light-tailed.
Drake Thomas and I proved theorems about this here, and there might be another post coming soon about the nonindependent case.
I think I’m not super into the U = V + X framing; that seems to inherently suggest that there exists some component of the true utility V “inside” the proxy U everywhere, and which is merely perturbed by some error term rather than washed out entirely (in the manner I’d expect to see from an actual misspecification). In a lot of the classic Goodhart cases, the source of the divergence between measurement and desideratum isn’t regressional, and so V and X aren’t independent.
(Consider e.g. two arbitrary functions U’ and V’, and compute the “error term” X’ between them. It should be obvious that when U’ is maximized, X’ is much more likely to be large than V’ is; which is simply another way of saying that X’ isn’t independent of V’, since it was in fact computed from V’ (and U’). The claim that the reward model isn’t even “approximately correct”, then, is basically this: that there is a separate function U being optimized whose correlation with V within-distribution is in some sense coincidental, and that out-of-distribution the two become basically unrelated, rather than one being expressible as a function of the other plus some well-behaved error term.)
I think independence is probably the biggest weakness of the post just because it’s an extremely strong assumption, but I have reasons why the U = V+X framing is natural here. The error term X has a natural meaning in the case where some additive terms of V are not captured in U (e.g. because they only exist off the training distribution), or some additive terms of U are not in V (e.g. because they’re ways to trick the overseer).
The example of two arbitrary functions doesn’t seem very central because it seems to me that if we train U to approximate V, its correlation in distribution will be due to the presence of features in the data, rather than being coincidental. Maybe the features won’t be additive or independent and we should think about those cases though. It still seems possible to prove things if you weaken independence to unbiasedness.
Agree that we currently only analyze regressional and perhaps extremal Goodhart; people should be thinking about the other two as well.