Tomek Korbak comments on RL with KL penalties is better seen as Bayesian inference

Tomek Korbak 26 May 2022 19:34 UTC
2 points

I expect that in the current regime (only optimizing the policy a small amount), any method that does a reasonable job of maximizing reward while controlling how much the policy changes can be made to work in practice

Yes, that seems plausible. Though as you said, most methods that only change the policy a bit (early stopping, clipping in PPO) do that via implicit KL penalties and still can be seen as updating a prior.

there would be an exploration-exploitation trade-off, which is something that the RL perspective may again offer insight into.

Definitely exploration-exploitation issues could make the distribution collapse more severe and traditional RL tricks could help with that. But I still believe distribution collapse does not reduce to insufficient exploration and good exploration alone won’t solve it. In this specific instance, failing to find the optimal policy is not the problem, the optimal policy itself is the problem.