For text-davinci-002 the goal is to have the model do what the user asked as well as it can, not to sample from possible worlds. For example, if the user asks “Is X true?” and the model’s probability is 80%, the intended behavior is for the model to say “Probably” 100% of the time, not to say “Yes” 80% of the time and “No” 20% of the time.
This is often (usually?) the desired behavior. For pre-trained LMs people usually turn the temperature down (or use nucleus sampling or beam search or whatever) in order to get more reasonable behavior, but that introduces pathologies and so you’d prefer not need to do it.
There are a number of reasons this behavior can be undesirable though:
Sometimes you want entropy, e.g. if you’ll have a user pick their favorite from N completions or are doing majority voting with chain of thought.
This model is not competent enough to say “Probably” 100% of the time. Instead I expect it will just say “Yes” 100% of the time. Extracting confidence from logits is a plausible way to get around this difficulty, but only works for the generative model.
If what you fundamentally want is a plausible completions of text from the pretraining distribution then you are definitely worse off. You can ask the instruct model to “complete the following text” but it does much worse than the pure generative model.
Each model randomly does better on some stuff and worse on other stuff, even if one model is better on average the other will be better for some particular tasks.
You may just want to scientifically study the pure generative model.
Intuitively it seems like OpenAI should offer both models, and should gradually try to improve the instruct model so that there are fewer and fewer non-academic reasons to use the pure generative model.
A stark example of mode collapse that seems unlikely to have been directly incentivized by RLHF training: I asked RLHF models and base models to generate random numbers and found that RLHF models tend to be sharply biased toward certain “random” numbers
The RLHF model represents labeler preferences as R(prompt, completion). One limitation of that approach is that R can’t depend on the distribution of model outputs. But real human preferences do depend on that distribution, since the model operates in a world containing other copies of itself—e.g. other copies operating in parallel for best-of-N.
This limitation makes it extremely difficult for the model to converge to a reasonable randomized strategy—it will still happen in the limit, but I think it may take orders of magnitude more labels. I think this is something OpenAI should be interested in fixing, though it would be reasonable to prioritize it (compared to other alignment issues) based on how often customers care. I don’t think the issue is necessarily conceptually complicated though I think the whole thing is technically gnarly (both on the training loss side and on the learning problem side).
Rather, the implication is that mode collapse itself generalizes out of distribution for some reason. This is intriguing: it seems to point at an algorithmic difference between self-supervised pretrained models and the same models after a comparatively small amount optimization from the RLHF training process which significantly changes out-of-distribution generalization.
I think that a large part of what happens durning fine-tuning is the model learning to stick with t=0 answers wherever possible, since that’s the answer that is most likely to be correct or reasonable. I don’t think it necessarily represents an interesting algorithmic difference; you’d see the same observations about generalization if the main effect of fine-tuning was just scaling up all the logits a bunch. Obviously it does something slightly more sophisticated, since it avoids some (but not all) t=0 pathologies. But I suspect it will transfer to other prompts for the same kinds of reasons that “scale up all logits” would.
In contrast to text-davinci-002, where dissimilar prompts tend to fall into basins of different attractors, the wedding parties attractor is global, affecting trajectories starting from any prompt, or at least a very wide distribution
I think this is because text-davinci-002 is optimizing for how well the completion addresses the user’s request, and so different completions will get a high reward for different prompts.
The sentiment model is optimizing for the sentiment of the completion, using a weak predictor of sentiment that likely has much more confidence about weddings than other positive events, and so “wedding” is just the highest-sentiment completion no matter how the story starts.
I do agree think there are two product use cases with instruct that have distinct optimal levels of entropy.
1. The more explorative use cases you have mentioned. And for example when users do want diversity e.g. generating story ideas 2. Having factual / accurate answers
I’m not sure how exactly OpenAI set their “KL budgets” for davinci instruct. For WebGPT3 they “compared a couple of KL budgets using human evaluations”. And those evaluations were for how factual the answers were.
So in that scenario, we’ll see a KL budget that optimizes for 2. Since the users don’t care about the diversity of multiple generations. They just care about the factual quality of a single generation.
Now i’m interested to see what happens if we somehow change the evaluations such that users e.g. are shown 3 examples from each model. In a scenario where diversity is desirable (e.g. generating story ideas). Now in deciding for the KL budget, we will probably get a much lower number. And that will allow them to serve a model more suited to tasks 1.
For text-davinci-002 the goal is to have the model do what the user asked as well as it can, not to sample from possible worlds. For example, if the user asks “Is X true?” and the model’s probability is 80%, the intended behavior is for the model to say “Probably” 100% of the time, not to say “Yes” 80% of the time and “No” 20% of the time.
This is often (usually?) the desired behavior. For pre-trained LMs people usually turn the temperature down (or use nucleus sampling or beam search or whatever) in order to get more reasonable behavior, but that introduces pathologies and so you’d prefer not need to do it.
There are a number of reasons this behavior can be undesirable though:
Sometimes you want entropy, e.g. if you’ll have a user pick their favorite from N completions or are doing majority voting with chain of thought.
This model is not competent enough to say “Probably” 100% of the time. Instead I expect it will just say “Yes” 100% of the time. Extracting confidence from logits is a plausible way to get around this difficulty, but only works for the generative model.
If what you fundamentally want is a plausible completions of text from the pretraining distribution then you are definitely worse off. You can ask the instruct model to “complete the following text” but it does much worse than the pure generative model.
Each model randomly does better on some stuff and worse on other stuff, even if one model is better on average the other will be better for some particular tasks.
You may just want to scientifically study the pure generative model.
Intuitively it seems like OpenAI should offer both models, and should gradually try to improve the instruct model so that there are fewer and fewer non-academic reasons to use the pure generative model.
The RLHF model represents labeler preferences as R(prompt, completion). One limitation of that approach is that R can’t depend on the distribution of model outputs. But real human preferences do depend on that distribution, since the model operates in a world containing other copies of itself—e.g. other copies operating in parallel for best-of-N.
This limitation makes it extremely difficult for the model to converge to a reasonable randomized strategy—it will still happen in the limit, but I think it may take orders of magnitude more labels. I think this is something OpenAI should be interested in fixing, though it would be reasonable to prioritize it (compared to other alignment issues) based on how often customers care. I don’t think the issue is necessarily conceptually complicated though I think the whole thing is technically gnarly (both on the training loss side and on the learning problem side).
I think that a large part of what happens durning fine-tuning is the model learning to stick with t=0 answers wherever possible, since that’s the answer that is most likely to be correct or reasonable. I don’t think it necessarily represents an interesting algorithmic difference; you’d see the same observations about generalization if the main effect of fine-tuning was just scaling up all the logits a bunch. Obviously it does something slightly more sophisticated, since it avoids some (but not all) t=0 pathologies. But I suspect it will transfer to other prompts for the same kinds of reasons that “scale up all logits” would.
I think this is because text-davinci-002 is optimizing for how well the completion addresses the user’s request, and so different completions will get a high reward for different prompts.
The sentiment model is optimizing for the sentiment of the completion, using a weak predictor of sentiment that likely has much more confidence about weddings than other positive events, and so “wedding” is just the highest-sentiment completion no matter how the story starts.
I do agree think there are two product use cases with instruct that have distinct optimal levels of entropy.
1. The more explorative use cases you have mentioned. And for example when users do want diversity e.g. generating story ideas
2. Having factual / accurate answers
I’m not sure how exactly OpenAI set their “KL budgets” for davinci instruct.
For WebGPT3 they “compared a couple of KL budgets using human evaluations”. And those evaluations were for how factual the answers were.
So in that scenario, we’ll see a KL budget that optimizes for 2. Since the users don’t care about the diversity of multiple generations. They just care about the factual quality of a single generation.
Now i’m interested to see what happens if we somehow change the evaluations such that users e.g. are shown 3 examples from each model. In a scenario where diversity is desirable (e.g. generating story ideas). Now in deciding for the KL budget, we will probably get a much lower number. And that will allow them to serve a model more suited to tasks 1.