In the first model I thought through, though, I don’t think that you’re right: if you train a model with RL with a KL penalty, it will end up with a policy that outputs a distribution over answers which is equivalent to taking the generative distribution and then applying a Boltzmann factor to upweight answers that your overseer likes. AFAICT this doesn’t generally induce more causal Goodhart problems than best-of-N selection does.
As far as I can tell, the argument for this assumes that the model can generalise the reward function perfectly, which seems questionable because it only sees a few samples of the reward function (call this claim 1).
One possibility is that the Boltzmann factor upweights answers the model is confident the overseer will like more than answers it’s less confident about. This could end up applying pressure to concentrate on high-confidence approved answers, which could break desirable correlations (call this claim 2).
I’m fairly confident in the claim that in general a Bayesian learner will not generalise the reward function perfectly (claim 1); how the reward function generalises will typically depend on the posterior over parameters and there are different posteriors that yield the same distribution over texts (Sam’s sibling comment illustrates this point). I’ve no idea about what’s going on with transformers, though—they can generalise text prediction quite well, so maybe they generalise approval quite well too, but that’s just a wild guess. Claim 2 is idle speculation that I only mean to illustrate the point about pressure to break desirable correlations.
As far as I can tell, the argument for this assumes that the model can generalise the reward function perfectly, which seems questionable because it only sees a few samples of the reward function (call this claim 1).
One possibility is that the Boltzmann factor upweights answers the model is confident the overseer will like more than answers it’s less confident about. This could end up applying pressure to concentrate on high-confidence approved answers, which could break desirable correlations (call this claim 2).
I’m fairly confident in the claim that in general a Bayesian learner will not generalise the reward function perfectly (claim 1); how the reward function generalises will typically depend on the posterior over parameters and there are different posteriors that yield the same distribution over texts (Sam’s sibling comment illustrates this point). I’ve no idea about what’s going on with transformers, though—they can generalise text prediction quite well, so maybe they generalise approval quite well too, but that’s just a wild guess. Claim 2 is idle speculation that I only mean to illustrate the point about pressure to break desirable correlations.