RobertKirk comments on RL with KL penalties is better seen as Bayesian inference

RobertKirk 30 May 2022 13:44 UTC
LW: 1 AF: 1
AF
Do you think these insights would generalise to the case where the language model may be interacting with some system during this fine-tuning phase? For example, if it generates queries to an external search engine or API, or has dialogue with a human, then the optimal policy is no longer equivalent to just generating the correct output distribution, as it now also involves environment observations. This setting makes the distinction between a generative model and a policy clearer, and maybe changes the relevance of this statement:

The problem with the RL objective is that it treats the LM as a policy, not as a generative model. While a generative model is supposed to capture a diverse distribution of samples, a policy is supposed to chose the optimal action.

That is, in these settings we do want an optimal policy and not a generative model. This is quite similar to what Paul was saying as well.

Further, if we see aligning language models as a proxy for aligning future strong systems, it seems likely that these systems will be taking multiple steps of interaction in some environment, rather than just generating one (or several) sequences without feedback.
- Tomek Korbak 2 Jun 2022 17:04 UTC
  LW: 2 AF: 1
  AF Parent
  
  Do you think these insights would generalise to the case where the language model may be interacting with some system during this fine-tuning phase? For example, if it generates queries to an external search engine or API, or has dialogue with a human, then the optimal policy is no longer equivalent to just generating the correct output distribution, as it now also involves environment observations.
  
  That’s a good point and helps to make a distinction between generative models and policies. In the interactive case, your policy pi(a|s) is conditional distribution. You can equivalently view it as a collection of unconditional distributions {pi_s(a)}, one for each s, and for each of these you are likely to also have distribution collapse (single best action for a given state). Arguably, that’s what you want in RL.
  
  So I think it mostly comes down to a philosophical difference. Do you want your LM to be a decision-maker acting in a world or a model of a some probability distribution over texts? If you want a decision-maker and training on language is just a scaffolding to get you there, maybe indeed staying close to the original distribution only has instrumental value?
  
  But what if what you want is just an oracle-type conversational AI: a knowledge base and a common-sense reasoner. Maybe in this case staying close to human knowledge and inference rules represented in language is of inherent value?
  - RobertKirk 2 Jun 2022 17:46 UTC
    LW: 2 AF: 2
    AF Parent
    
    So I think it mostly comes down to a philosophical difference. Do you want your LM to be a decision-maker acting in a world or a model of a some probability distribution over texts? If you want a decision-maker and training on language is just a scaffolding to get you there, maybe indeed staying close to the original distribution only has instrumental value?
    
    But what if what you want is just an oracle-type conversational AI: a knowledge base and a common-sense reasoner. Maybe in this case staying close to human knowledge and inference rules represented in language is of inherent value?
    
    I feel like this wasn’t exactly made clear in the post/paper that this was the motivation. You state that distribution collapse is bad without really justifying it (in my reading). Clearly distributional collapse to a degenerate bad output is bad, and also will often stall learning so is bad from an optimisation perspective as well (as it makes exploration much less likely), but this seems different from distributional collapse to the optimal output. For example, when you say
    
    if x∗ is truly the best thing, we still wouldn’t want the LM to generate only x∗
    
    I think I just disagree, in the case where we’re considering LLMs as a model for future agent-like systems that we will have to align, which to me is the reason they’re useful for alignment research. If there’s a normative claim that diversity is important, then you should just have that in your objective/algorithm.
    
    I think the reason KL divergence is included in RLHF is an optimisation hack to make sure it does well. Maybe that’s revealed that for alignment we actually wanted the Bayesian posterior distribution you describe rather than just the optimal distribution according to the reward function (i.e. a hardmax rather than a softmax on the reward over trajectories), although that seems to be an empirical question whether that was our preference all along or it’s just useful in the current regime.