Lucius Bushnaq comments on RL with KL penalties is better seen as Bayesian inference