My vague understanding is that to correspond with Bayesian updating, RL has to have a quite restrictive KL penalty, and in practice people use much less—which might be like Bayesian updating on the pretend dataset where you’ve seen 50 of each RL example.
Is this accurate? Has anyone produced interesting examples of RL faithful to the RL-as-updating recipe, that you know of?
Yes, you are correct that RL with KL penalties only approximates a Bayesian update in the limit, after enough steps to converge. Determining the speed of this convergence, especially for LLMs, remains an area for future work.
My vague understanding is that to correspond with Bayesian updating, RL has to have a quite restrictive KL penalty, and in practice people use much less—which might be like Bayesian updating on the pretend dataset where you’ve seen 50 of each RL example.
Is this accurate? Has anyone produced interesting examples of RL faithful to the RL-as-updating recipe, that you know of?
Yes, you are correct that RL with KL penalties only approximates a Bayesian update in the limit, after enough steps to converge. Determining the speed of this convergence, especially for LLMs, remains an area for future work.