Charlie Steiner comments on RL with KL penalties is better seen as Bayesian inference

Charlie Steiner 25 May 2022 22:43 UTC
LW: 3 AF: 2
AF
You mention converging to a deterministic policy is bad because of repetition, but did I miss you addressing that it’s also bad because we want diversity? (Edit: now that I reread that sentence, it makes no sense. Sorry!) In some sense we don’t want RL in the limit, we want something a little more aware that we want to sample from a distribution and get lots of different continuations that are all pretty good.