Thanks for sharing your thoughts, I found these remarks extremely insightful!
It seems like ideal way forward is to more accurately capture what you actually care about, then optimize that—staying close to the original distribution feels like more of a hack to me. It seems like you view the original distribution of webtext as more principled or fundamental than I do, but I’m not sure what accounts for that difference.
A reply that comes to mins is that maybe being grounded in human knowledge, reasoning rules and values represented in web text has inherent value? Maybe web text is already approximately aligned with human preferences and you only want tweak that distribution a bit to match true human preferences? Assume that’s the case. Then, we can decompose LM alignment into (i) learning web text distribution and (ii) learning how to warp web text distribution. It seems that (ii) is easier than just learning aligned behaviour from scratch: your reward model doesn’t have to work well on arbitrary text but only text from distributions similar to webtext.
Another way of phrasing that point: maybe the assumption that you can have a perfect reward model is unrealistic and we can offload some of the complexity of learning a reward model to a prior given by web text? Or more philosophically, if you’re a Bayesian, you shouldn’t trust your reward model blindly, you should still have some prior.
Thanks for sharing your thoughts, I found these remarks extremely insightful!
A reply that comes to mins is that maybe being grounded in human knowledge, reasoning rules and values represented in web text has inherent value? Maybe web text is already approximately aligned with human preferences and you only want tweak that distribution a bit to match true human preferences? Assume that’s the case. Then, we can decompose LM alignment into (i) learning web text distribution and (ii) learning how to warp web text distribution. It seems that (ii) is easier than just learning aligned behaviour from scratch: your reward model doesn’t have to work well on arbitrary text but only text from distributions similar to webtext.
Another way of phrasing that point: maybe the assumption that you can have a perfect reward model is unrealistic and we can offload some of the complexity of learning a reward model to a prior given by web text? Or more philosophically, if you’re a Bayesian, you shouldn’t trust your reward model blindly, you should still have some prior.