I broadly agree with this perspective, and think the modeling vs inference distinction is a valuable one to make.
That said, it seems to me that in practice you often should be trying to converge to a zero entropy policy. The optimal action is not random; we want the model to be picking the output that looks best in light of its beliefs. (Old Eliezer post.)
For some applications you are sampling from your model multiple times and ensembling. In this case randomness can help if you have no memory. But in the same setting, optimizing the correct reward function (“solve the problem given that none of your previous tries worked”) wouldn’t involve converging to zero entropy, because the quality of an output decreases naturally as it becomes more similar to your previous outputs. (Though I think this is a minority of language model sampling in practice.)
It seems like ideal way forward is to more accurately capture what you actually care about, then optimize that—staying close to the original distribution feels like more of a hack to me. It seems like you view the original distribution of webtext as more principled or fundamental than I do, but I’m not sure what accounts for that difference. For example, if you want to have a model that creates 10 diverse samples when run 10 times in parallel, I don’t see why “stay close to webtext” would be a particularly good way to do that. It seems like incorporating diversity directly into the reward function is the way you’d like to go in the long run, it poses some challenges in the short run, and “match the webtext distribution” is a shortcut you might use before solving the fundamental problem.
I’m personally mostly interested in KL minimization in order to ensure you continue exploring and to ensure the policy changes slowly enough for the reward model to keep up (avoiding Goodhart destroying the approximation quality of the preference model, and perhaps more speculatively avoiding it destroying the approximation quality of a limited overseer human). But as Jacob Hilton points out, right now it seems like a wide range of options will work out OK for that.
I think I broadly disagree with this—using RLHF in a way that actually corresponds to some conditional of the original policy seems very important to me.
That said, it seems to me that in practice you often should be trying to converge to a zero entropy policy. The optimal action is not random; we want the model to be picking the output that looks best in light of its beliefs.
I think you probably expect that we’ll eventually be able to give good feedback even for quite complex tasks like “actually say the truth”—but if you don’t expect that to ever succeed (as I think I don’t), then staying close to existing data seems like a pretty important tool to prevent yourself from Goodharting on bad feedback.
It seems like ideal way forward is to more accurately capture what you actually care about, then optimize that
What if what you actually care about is predicting the world well? In that case, staying close to existing data actually seems like a very principled thing to do. In particular, the KL penalty here lets you interpret the feedback as just extracting a particular conditioned distribution, which is precisely the thing I think you wantto do with such a predictor.
Thanks for sharing your thoughts, I found these remarks extremely insightful!
It seems like ideal way forward is to more accurately capture what you actually care about, then optimize that—staying close to the original distribution feels like more of a hack to me. It seems like you view the original distribution of webtext as more principled or fundamental than I do, but I’m not sure what accounts for that difference.
A reply that comes to mins is that maybe being grounded in human knowledge, reasoning rules and values represented in web text has inherent value? Maybe web text is already approximately aligned with human preferences and you only want tweak that distribution a bit to match true human preferences? Assume that’s the case. Then, we can decompose LM alignment into (i) learning web text distribution and (ii) learning how to warp web text distribution. It seems that (ii) is easier than just learning aligned behaviour from scratch: your reward model doesn’t have to work well on arbitrary text but only text from distributions similar to webtext.
Another way of phrasing that point: maybe the assumption that you can have a perfect reward model is unrealistic and we can offload some of the complexity of learning a reward model to a prior given by web text? Or more philosophically, if you’re a Bayesian, you shouldn’t trust your reward model blindly, you should still have some prior.
I broadly agree with this perspective, and think the modeling vs inference distinction is a valuable one to make.
That said, it seems to me that in practice you often should be trying to converge to a zero entropy policy. The optimal action is not random; we want the model to be picking the output that looks best in light of its beliefs. (Old Eliezer post.)
For some applications you are sampling from your model multiple times and ensembling. In this case randomness can help if you have no memory. But in the same setting, optimizing the correct reward function (“solve the problem given that none of your previous tries worked”) wouldn’t involve converging to zero entropy, because the quality of an output decreases naturally as it becomes more similar to your previous outputs. (Though I think this is a minority of language model sampling in practice.)
It seems like ideal way forward is to more accurately capture what you actually care about, then optimize that—staying close to the original distribution feels like more of a hack to me. It seems like you view the original distribution of webtext as more principled or fundamental than I do, but I’m not sure what accounts for that difference. For example, if you want to have a model that creates 10 diverse samples when run 10 times in parallel, I don’t see why “stay close to webtext” would be a particularly good way to do that. It seems like incorporating diversity directly into the reward function is the way you’d like to go in the long run, it poses some challenges in the short run, and “match the webtext distribution” is a shortcut you might use before solving the fundamental problem.
I’m personally mostly interested in KL minimization in order to ensure you continue exploring and to ensure the policy changes slowly enough for the reward model to keep up (avoiding Goodhart destroying the approximation quality of the preference model, and perhaps more speculatively avoiding it destroying the approximation quality of a limited overseer human). But as Jacob Hilton points out, right now it seems like a wide range of options will work out OK for that.
I think I broadly disagree with this—using RLHF in a way that actually corresponds to some conditional of the original policy seems very important to me.
I think you probably expect that we’ll eventually be able to give good feedback even for quite complex tasks like “actually say the truth”—but if you don’t expect that to ever succeed (as I think I don’t), then staying close to existing data seems like a pretty important tool to prevent yourself from Goodharting on bad feedback.
What if what you actually care about is predicting the world well? In that case, staying close to existing data actually seems like a very principled thing to do. In particular, the KL penalty here lets you interpret the feedback as just extracting a particular conditioned distribution, which is precisely the thing I think you want to do with such a predictor.
Thanks for sharing your thoughts, I found these remarks extremely insightful!
A reply that comes to mins is that maybe being grounded in human knowledge, reasoning rules and values represented in web text has inherent value? Maybe web text is already approximately aligned with human preferences and you only want tweak that distribution a bit to match true human preferences? Assume that’s the case. Then, we can decompose LM alignment into (i) learning web text distribution and (ii) learning how to warp web text distribution. It seems that (ii) is easier than just learning aligned behaviour from scratch: your reward model doesn’t have to work well on arbitrary text but only text from distributions similar to webtext.
Another way of phrasing that point: maybe the assumption that you can have a perfect reward model is unrealistic and we can offload some of the complexity of learning a reward model to a prior given by web text? Or more philosophically, if you’re a Bayesian, you shouldn’t trust your reward model blindly, you should still have some prior.