I suspect this maybe because RLHF elicits a singular scale of “goodness” judgements from humans, instead of a plurality of “goodness-of-a-kind” judgements. One way to interpret language models is as *mixtures* of conversational agents: they first sample some conversational goal, then some policy over words, conditioned on that goal:
On this interpretation, what RL from human feedback does is shift/concentrate the distribution over conversational goals into a smaller range: the range of goals consistent with human feedback so far. And if humans are asked to give only a singular “goodness” rating, the distribution will shift towards only goals that do well on those ratings—perhaps dramatically so! We lose goal diversity, which means less gibberish, but also less of the plurality of realistic human goals.
If the above is true, one corollary is that we should expect to see less mode collapse if one finetunes a language model on ratings elicited using a diversity of instructions (e.g. is this completion interesting? helpful? accurate?), and perhaps use some kind of imitation-learning inspired objective to mimic that distribution, rather than PPO (which is meant to only optimize for a singular reward function instead of a distribution over reward functions).
I agree that the (unprompted) generative model is doing something kind of like: choose a random goal, then optimize it.
In some sense that does reflect the “plurality of realistic human goals.” But I don’t think it’s a good way to reflect that diversity. It seems like you want to either (i) be able to pick which goal you pursue, (ii) optimize an aggregate of several goals.
Either way, I think that’s probably best reflected by a deterministic reward function, and you’d probably prefer be mindful about what you are getting rather than randomly sampling from webtext. (Though as I mention in my other comment, I think there are other good reasons to want the pure generative model.)
This seems like a good way to think about some of the examples of mode collapse, but doesn’t obviously cover all the cases. For example, when asking the model to produce a random number, is it really the case that there’s a particular conversational goal which the RLHF’d model is optimizing, such that 97 is the best random number for that goal? In this case, Paul’s guess that RLHF’d models tend to push probability mass onto the base model’s most likely tokens seems more explanatory.
Fascinating evidence!
I suspect this maybe because RLHF elicits a singular scale of “goodness” judgements from humans, instead of a plurality of “goodness-of-a-kind” judgements. One way to interpret language models is as *mixtures* of conversational agents: they first sample some conversational goal, then some policy over words, conditioned on that goal:
P(w1,w2,w3,...)=∫goalP(goal)P(w1,w2,w3,...|goal)P(w1,w2,w3,...|prompt)=∫goalP(goal|prompt)P(w1,w2,w3,...|goal)
On this interpretation, what RL from human feedback does is shift/concentrate the distribution over conversational goals into a smaller range: the range of goals consistent with human feedback so far. And if humans are asked to give only a singular “goodness” rating, the distribution will shift towards only goals that do well on those ratings—perhaps dramatically so! We lose goal diversity, which means less gibberish, but also less of the plurality of realistic human goals.
If the above is true, one corollary is that we should expect to see less mode collapse if one finetunes a language model on ratings elicited using a diversity of instructions (e.g. is this completion interesting? helpful? accurate?), and perhaps use some kind of imitation-learning inspired objective to mimic that distribution, rather than PPO (which is meant to only optimize for a singular reward function instead of a distribution over reward functions).
I agree that the (unprompted) generative model is doing something kind of like: choose a random goal, then optimize it.
In some sense that does reflect the “plurality of realistic human goals.” But I don’t think it’s a good way to reflect that diversity. It seems like you want to either (i) be able to pick which goal you pursue, (ii) optimize an aggregate of several goals.
Either way, I think that’s probably best reflected by a deterministic reward function, and you’d probably prefer be mindful about what you are getting rather than randomly sampling from webtext. (Though as I mention in my other comment, I think there are other good reasons to want the pure generative model.)
This seems like a good way to think about some of the examples of mode collapse, but doesn’t obviously cover all the cases. For example, when asking the model to produce a random number, is it really the case that there’s a particular conversational goal which the RLHF’d model is optimizing, such that 97 is the best random number for that goal? In this case, Paul’s guess that RLHF’d models tend to push probability mass onto the base model’s most likely tokens seems more explanatory.