Nina Panickssery comments on RL with KL penalties is better seen as Bayesian inference

Nina Panickssery 5 Jul 2023 23:49 UTC
1 point
even if $r$ is a smooth, real-valued function and it perfectly captures human preferences across the whole space of possible sequences $X$ and if $x^{*}$ is truly the best thing, we still wouldn’t want the LM to generate only $x^{*}$
Is this fundamentally true? I understand why this is in practice the case, as a model can only capture limited information due to noninfinite parameters and compute. And therefore trying to model the optimal output is too hard, and you need to include some entropy/uncertainty in your model, which means you should aim to capture an accurate probability distribution over answers. However, if we were able to perfectly predict the optimal output at all times, surely this would be good?

As an analogy, if we are trying to model the weather, which is notoriously hard and chaotic, with a limited number of parameters, we should aim to output a probability distribution over weather conditions.

However, if we want to predict the shortest path through a maze, getting the exact correct shortest past is better than spreading probabilities over the top n shortest paths.