LawrenceC comments on Trying to disambiguate different questions about whether RLHF is “good”

LawrenceC 15 Dec 2022 16:53 UTC
LW: 4 AF: 2
2
AF
It’s true that minimizing KL subject to a constraint of always exceeding a certain reward threshold would theoretically be equivalent to Bayesian conditioning and therefore equivalent to filtering.
It’s also true that maximizing Reward—KL is Bayesian updating as the linked post shows, and it’s true that maximizing reward subject to a KL constraint is also equivalent to Bayesian updating as well (by Lagrangian multipliers). You see similar results with Max Ent RL (where you maximize Reward + Entropy, which is equal to a constant minus the KL relative to a uniform distribution), for example.
Unfortunately, this would be computationally difficult with gradient descent since you would already have mode-collapse before the KL penalty started to act.
Sounds like you need to increase the KL penalty, then!
I think this setup has a linear tradeoff between how much helpfulness you get and how much you avoid Causal Goodhart.
I don’t see why this argument doesn’t also apply to the conditioning case—if you condition on a proxy reward being sufficiently high, you run into the exact same issues as w/ KL regularized RL with binarized reward.
requires explicitly transforming the reward into a Boltzmann distribution
This seems like a misunderstanding of the post (and the result in general) -- it shows that doing RL with KL constraints is equivalent to Bayesian updating the LM prior with a $e^{r (x) / β}$ likelihood (and the update people use in practice is equivalent to variational inference). You wouldn’t do this updating explicitly, because computing the normalizing factor $Z$ is too hard (as usual); instead you just optimize RL—KL as you usually would.
(Or use a decoding scheme to skip the training entirely; I’m pretty sure you can just do normal MCMC or approximate it with weight decoding/PPLM.)