I think independence is probably the biggest weakness of the post just because it’s an extremely strong assumption, but I have reasons why the U = V+X framing is natural here. The error term X has a natural meaning in the case where some additive terms of V are not captured in U (e.g. because they only exist off the training distribution), or some additive terms of U are not in V (e.g. because they’re ways to trick the overseer).
The example of two arbitrary functions doesn’t seem very central because it seems to me that if we train U to approximate V, its correlation in distribution will be due to the presence of features in the data, rather than being coincidental. Maybe the features won’t be additive or independent and we should think about those cases though. It still seems possible to prove things if you weaken independence to unbiasedness.
Agree that we currently only analyze regressional and perhaps extremal Goodhart; people should be thinking about the other two as well.
I think independence is probably the biggest weakness of the post just because it’s an extremely strong assumption, but I have reasons why the U = V+X framing is natural here. The error term X has a natural meaning in the case where some additive terms of V are not captured in U (e.g. because they only exist off the training distribution), or some additive terms of U are not in V (e.g. because they’re ways to trick the overseer).
The example of two arbitrary functions doesn’t seem very central because it seems to me that if we train U to approximate V, its correlation in distribution will be due to the presence of features in the data, rather than being coincidental. Maybe the features won’t be additive or independent and we should think about those cases though. It still seems possible to prove things if you weaken independence to unbiasedness.
Agree that we currently only analyze regressional and perhaps extremal Goodhart; people should be thinking about the other two as well.