Thanks! Causal Goodhart is a good point, and I buy now that RLHF seems even worse from a Goodhart perspective than filtering. Just unsure by how much, and how bad filtering itself is. In particular:
In the case of useful and human-approved answers, I expect that in fact, there exist maximally human-approved answers that are also maximally useful
This is the part I’m still not sure about. For example, maybe the simplest/apparently-easiest-to-understand answer that looks good to humans tends to be false. Then if human raters prefer simpler answers (because they’re more confident in their evaluations of those), the maximally approved answers might be bad. This is similar to the truths humans can’t be convinced of you mention, but with the difference that it’s just a matter of how convinced humans are by different answers. We could then be in a situation where both filtering and RLHF suffer a lot from Goodhart’s law, and while RLHF might technically be even worse, the difference wouldn’t matter in practice since we’d need a solution to the fundamental problem anyway.
I feel like the key question here is how much selection pressure we apply. My sense is that for sufficient amounts of selection pressure, we do quite plausibly run into extremal Goodhart problems like that. But it also seems plausible we wouldn’t need to select that hard (e.g. we don’t need the single most compelling answer), in which case I agree with what you said.
As a caveat, I didn’t think of the RL + KL = Bayesian inference result when writing this, I’m much less sure now (and more confused).
Anyway, what I meant: think of the computational graph of the model as a causal graph, then changing the weights via RLHF is an intervention on this graph. It seems plausible there are somewhat separate computational mechanisms for producing truth and for producing high ratings inside the model, and RLHF could then reinforce the high rating mechanism without correspondingly reinforcing the truth mechanism, breaking the correlation. I certainly don’t think there will literally be cleanly separable circuits for truth and high rating, but I think the general idea plausibly applies. I don’t see how anything comparable happens with filtering.
Without some form of regularization, some forms of RL can lead to trajectories that have zero probability wrt the base distribution (e.g. because they break a correlation that occurs on the pretraining distribution with 100% accuracy). However, sampling cannot lead to trajectories with zero probability?
As stated, this claim is false for LMs without top-p sampling or floating point rounding errors, since every token has a logit greater than negative infinity and thus a probability greater than actual 0. So with enough sampling, you’ll find the RL trajectories.
This is obviously a super pedantic point: RL finds sentences with cross entropy of 30+ nats wrt to the base distribution all the time, while you’ll never do Best-of-exp(30)~=1e13. And there’s an empirical question of how much performance you get versus how far your new policy is from the old one, e.g. if you look at Leo Gao’s recent RLHF paper, you’ll see that RL is more off distribution than BoN at equal proxy rewards.
That being said, I do think you need to make more points than just “RL can result in incredibly implausible trajectories” in order to claim that BoN is safer than RL, since I claim that Best-of-exp(30) is not clearly safe either!
No, I’m not claiming that. What I am claiming is something more like: there are plausible ways in which applying 30 nats of optimization via RLHF leads to worse results than best-of-exp(30) sampling, because RLHF might find a different solution that scores that highly on reward.
Toy example: say we have two jointly Gaussian random variables X and Y that are positively correlated (but not perfectly). I could sample 1000 pairs and pick the one with the highest X-value. This would very likely also give me an unusually high Y-value (how high depends on the correlation). Or I could change the parameters of the distribution such that a single sample will typically have an X-value as high as the 99.9th percentile of the old distribution. In that case, the Y-value I typically get will depend a lot on how I changed the parameters. E.g. if I just shifted the X-component of the mean and nothing else, I won’t get higher Y-values at all.
I’m pretty unsure what kinds of parameter changes RLHF actually induces, I’m just saying that parameter updates can destroy correlations in a way that conditioning doesn’t. This is with the same amount of selection pressure on the proxy in both cases.
RLHF could then reinforce the high rating mechanism without correspondingly reinforcing the truth mechanism, breaking the correlation.
I unconfidently think that in this case, RLHF will reinforce both mechanisms, but reinforce the high rating mechanism slightly more, which nets out to no clear difference from conditioning. But I wouldn’t be shocked to learn I was wrong.
Thanks! Causal Goodhart is a good point, and I buy now that RLHF seems even worse from a Goodhart perspective than filtering. Just unsure by how much, and how bad filtering itself is. In particular:
This is the part I’m still not sure about. For example, maybe the simplest/apparently-easiest-to-understand answer that looks good to humans tends to be false. Then if human raters prefer simpler answers (because they’re more confident in their evaluations of those), the maximally approved answers might be bad. This is similar to the truths humans can’t be convinced of you mention, but with the difference that it’s just a matter of how convinced humans are by different answers. We could then be in a situation where both filtering and RLHF suffer a lot from Goodhart’s law, and while RLHF might technically be even worse, the difference wouldn’t matter in practice since we’d need a solution to the fundamental problem anyway.
I feel like the key question here is how much selection pressure we apply. My sense is that for sufficient amounts of selection pressure, we do quite plausibly run into extremal Goodhart problems like that. But it also seems plausible we wouldn’t need to select that hard (e.g. we don’t need the single most compelling answer), in which case I agree with what you said.
Can you explain why RLHF is worse from a Causal Goodhart perspective?
As a caveat, I didn’t think of the RL + KL = Bayesian inference result when writing this, I’m much less sure now (and more confused).
Anyway, what I meant: think of the computational graph of the model as a causal graph, then changing the weights via RLHF is an intervention on this graph. It seems plausible there are somewhat separate computational mechanisms for producing truth and for producing high ratings inside the model, and RLHF could then reinforce the high rating mechanism without correspondingly reinforcing the truth mechanism, breaking the correlation. I certainly don’t think there will literally be cleanly separable circuits for truth and high rating, but I think the general idea plausibly applies. I don’t see how anything comparable happens with filtering.
I think your claim is something like:
As stated, this claim is false for LMs without top-p sampling or floating point rounding errors, since every token has a logit greater than negative infinity and thus a probability greater than actual 0. So with enough sampling, you’ll find the RL trajectories.
This is obviously a super pedantic point: RL finds sentences with cross entropy of 30+ nats wrt to the base distribution all the time, while you’ll never do Best-of-exp(30)~=1e13. And there’s an empirical question of how much performance you get versus how far your new policy is from the old one, e.g. if you look at Leo Gao’s recent RLHF paper, you’ll see that RL is more off distribution than BoN at equal proxy rewards.
That being said, I do think you need to make more points than just “RL can result in incredibly implausible trajectories” in order to claim that BoN is safer than RL, since I claim that Best-of-exp(30) is not clearly safe either!
No, I’m not claiming that. What I am claiming is something more like: there are plausible ways in which applying 30 nats of optimization via RLHF leads to worse results than best-of-exp(30) sampling, because RLHF might find a different solution that scores that highly on reward.
Toy example: say we have two jointly Gaussian random variables X and Y that are positively correlated (but not perfectly). I could sample 1000 pairs and pick the one with the highest X-value. This would very likely also give me an unusually high Y-value (how high depends on the correlation). Or I could change the parameters of the distribution such that a single sample will typically have an X-value as high as the 99.9th percentile of the old distribution. In that case, the Y-value I typically get will depend a lot on how I changed the parameters. E.g. if I just shifted the X-component of the mean and nothing else, I won’t get higher Y-values at all.
I’m pretty unsure what kinds of parameter changes RLHF actually induces, I’m just saying that parameter updates can destroy correlations in a way that conditioning doesn’t. This is with the same amount of selection pressure on the proxy in both cases.
Cool, I don’t think we disagree here.
I unconfidently think that in this case, RLHF will reinforce both mechanisms, but reinforce the high rating mechanism slightly more, which nets out to no clear difference from conditioning. But I wouldn’t be shocked to learn I was wrong.