Ok, this is pretty convincing. The only outlying question I have is whether each of these suboptimalities are easier for the agent to integrate into its reward model directly, or whether it’s easiest to incorporate them by deriving them via reasoning about the reward. It seems certain that eventually some suboptimalities will fall into the latter camp.
When that does happen, an interesting question is whether SGD will favor converting the entire objective to direct-reward-maximization, or whether direct-reward-maximization will be just a component of the objective along with other heuristics. One reason to suppose the latter might occur are findings like this paper (https://arxiv.org/pdf/2006.07710.pdf) which suggest that, once a model has achieved good accuracy, it won’t update to a better hypothesis even if that hypothesis is simpler. The more recent grokking literature might push us the other way, though if I understand correctly it mostly occurs when adding weight decay (creating a “stronger” simplicity prior).
Ok, this is pretty convincing. The only outlying question I have is whether each of these suboptimalities are easier for the agent to integrate into its reward model directly, or whether it’s easiest to incorporate them by deriving them via reasoning about the reward. It seems certain that eventually some suboptimalities will fall into the latter camp.
When that does happen, an interesting question is whether SGD will favor converting the entire objective to direct-reward-maximization, or whether direct-reward-maximization will be just a component of the objective along with other heuristics. One reason to suppose the latter might occur are findings like this paper (https://arxiv.org/pdf/2006.07710.pdf) which suggest that, once a model has achieved good accuracy, it won’t update to a better hypothesis even if that hypothesis is simpler. The more recent grokking literature might push us the other way, though if I understand correctly it mostly occurs when adding weight decay (creating a “stronger” simplicity prior).