I have relatively little idea how to “improve” a reward function so that it improves the inner cognition chiseled into the policy, because I don’t know the mapping from outer reward schedules to inner cognition within the agent.
You don’t need to know the full mapping in order to suspect that, when we reward agents for doing undesirable things, we tend to get more undesirable cognition. For example, if we reward agents for lying to us, then we’ll tend to get less honest agents. We can construct examples where this isn’t true but it seems like a pretty reasonable working hypothesis. It’s possible that discarding this working hypothesis will lead to better research but I don’t think your arguments manage to establish that, they only establish that we might in theory find ourselves in a situation where it’s reasonable to discard this working hypothesis.
This specific point is why I said “relatively” little idea, and not zero idea. You have defended the common-sense version of “improving” a reward function (which I agree with, don’t reward obvious bad things), but I perceive you to have originally claimed a much more aggressive and speculative claim, which is something like “‘amplified’ reward signals are improvements over non-‘amplified’ reward signals” (which might well be true, but how would we know?).
Amplification can just be used as a method for making more and better common-sense improvements, though. You could also do all sorts of other stuff with it, but standard examples (like “catch agents when they lie to us”) seem very much like common-sense improvements.
You don’t need to know the full mapping in order to suspect that, when we reward agents for doing undesirable things, we tend to get more undesirable cognition. For example, if we reward agents for lying to us, then we’ll tend to get less honest agents. We can construct examples where this isn’t true but it seems like a pretty reasonable working hypothesis. It’s possible that discarding this working hypothesis will lead to better research but I don’t think your arguments manage to establish that, they only establish that we might in theory find ourselves in a situation where it’s reasonable to discard this working hypothesis.
This specific point is why I said “relatively” little idea, and not zero idea. You have defended the common-sense version of “improving” a reward function (which I agree with, don’t reward obvious bad things), but I perceive you to have originally claimed a much more aggressive and speculative claim, which is something like “‘amplified’ reward signals are improvements over non-‘amplified’ reward signals” (which might well be true, but how would we know?).
Amplification can just be used as a method for making more and better common-sense improvements, though. You could also do all sorts of other stuff with it, but standard examples (like “catch agents when they lie to us”) seem very much like common-sense improvements.