The gradient pressure towards valuing reward terminally when you’ve already figured out reliable strategies for doing what humans want, seems very weak....in practice, it seems to me like these differences would basically only happen due to operator error, or cosmic rays, or other genuinely very rare events (as you describe in the “Security Holes” section).
Yeah, I disagree. With plain HFDT, it seems like there’s continuous pressure to improve things on the margin by being manipulative—telling human evaluators what they want to hear, playing to pervasive political and emotional and cognitive biases, minimizing and covering up evidence of slight suboptimalities to make performance on the task look better, etc. I think that in basically every complex training episode a model could do a little better by explicitly thinking about the reward and being a little-less-than-fully-forthright.
Ok, this is pretty convincing. The only outlying question I have is whether each of these suboptimalities are easier for the agent to integrate into its reward model directly, or whether it’s easiest to incorporate them by deriving them via reasoning about the reward. It seems certain that eventually some suboptimalities will fall into the latter camp.
When that does happen, an interesting question is whether SGD will favor converting the entire objective to direct-reward-maximization, or whether direct-reward-maximization will be just a component of the objective along with other heuristics. One reason to suppose the latter might occur are findings like this paper (https://arxiv.org/pdf/2006.07710.pdf) which suggest that, once a model has achieved good accuracy, it won’t update to a better hypothesis even if that hypothesis is simpler. The more recent grokking literature might push us the other way, though if I understand correctly it mostly occurs when adding weight decay (creating a “stronger” simplicity prior).
Yeah, I disagree. With plain HFDT, it seems like there’s continuous pressure to improve things on the margin by being manipulative—telling human evaluators what they want to hear, playing to pervasive political and emotional and cognitive biases, minimizing and covering up evidence of slight suboptimalities to make performance on the task look better, etc. I think that in basically every complex training episode a model could do a little better by explicitly thinking about the reward and being a little-less-than-fully-forthright.
Ok, this is pretty convincing. The only outlying question I have is whether each of these suboptimalities are easier for the agent to integrate into its reward model directly, or whether it’s easiest to incorporate them by deriving them via reasoning about the reward. It seems certain that eventually some suboptimalities will fall into the latter camp.
When that does happen, an interesting question is whether SGD will favor converting the entire objective to direct-reward-maximization, or whether direct-reward-maximization will be just a component of the objective along with other heuristics. One reason to suppose the latter might occur are findings like this paper (https://arxiv.org/pdf/2006.07710.pdf) which suggest that, once a model has achieved good accuracy, it won’t update to a better hypothesis even if that hypothesis is simpler. The more recent grokking literature might push us the other way, though if I understand correctly it mostly occurs when adding weight decay (creating a “stronger” simplicity prior).