Jacob_Hilton comments on RL with KL penalties is better seen as Bayesian inference

Jacob_Hilton 26 May 2022 16:45 UTC
LW: 3 AF: 3
AF
Great post! This seems like a useful perspective to keep in mind.
Somewhat orthogonally to the theoretical picture, I expect that in the current regime (only optimizing the policy a small amount), any method that does a reasonable job of maximizing reward while controlling how much the policy changes can be made to work in practice. For example, if PPO is tuned appropriately, the KL penalty term can be removed from the reward entirely—instead, PPO’s implicit “local” KL penalty controls the rate of policy change.
If we were in the regime of optimizing the policy significantly more, experience from traditional RL suggests that there would be an exploration-exploitation trade-off, which is something that the RL perspective may again offer insight into.
- Tomek Korbak 26 May 2022 19:34 UTC
  2 points
  Parent
  
  I expect that in the current regime (only optimizing the policy a small amount), any method that does a reasonable job of maximizing reward while controlling how much the policy changes can be made to work in practice
  
  Yes, that seems plausible. Though as you said, most methods that only change the policy a bit (early stopping, clipping in PPO) do that via implicit KL penalties and still can be seen as updating a prior.
  
  there would be an exploration-exploitation trade-off, which is something that the RL perspective may again offer insight into.
  
  Definitely exploration-exploitation issues could make the distribution collapse more severe and traditional RL tricks could help with that. But I still believe distribution collapse does not reduce to insufficient exploration and good exploration alone won’t solve it. In this specific instance, failing to find the optimal policy is not the problem, the optimal policy itself is the problem.