Charlie Steiner comments on Conditioning Predictive Models: Interactions with other approaches

Charlie Steiner 12 Feb 2023 15:11 UTC
LW: 2 AF: 1
2
AF
My vague understanding is that to correspond with Bayesian updating, RL has to have a quite restrictive KL penalty, and in practice people use much less—which might be like Bayesian updating on the pretend dataset where you’ve seen 50 of each RL example.
Is this accurate? Has anyone produced interesting examples of RL faithful to the RL-as-updating recipe, that you know of?
- Rubi J. Hudson 20 Feb 2023 2:07 UTC
  3 points
  0
  Parent
  Yes, you are correct that RL with KL penalties only approximates a Bayesian update in the limit, after enough steps to converge. Determining the speed of this convergence, especially for LLMs, remains an area for future work.