I think you’re (at least partly) right in spirit. The key extra nuance is that by constraining the ‘angle’ between the reward functions[1], you can rule out very opposed utilities like the one which rewards falling in a pit. So this is not true
to avoid Goodharting it has to consider every possible reward function that is improved by the first few bits of optimization pressure on the proxy objective.
In particular I think you’re imagining gradients in policy space (indeed a practical consideration). But this paper is considering gradients in occupancy space (which in practice is baking in some assumptions about foresight etc.).
I think you’re (at least partly) right in spirit. The key extra nuance is that by constraining the ‘angle’ between the reward functions[1], you can rule out very opposed utilities like the one which rewards falling in a pit. So this is not true
In particular I think you’re imagining gradients in policy space (indeed a practical consideration). But this paper is considering gradients in occupancy space (which in practice is baking in some assumptions about foresight etc.).
How? Yes this is a pretty big question (there are some theoretical and empirical ideas but I don’t rate any of them yet, personally).