I think I broadly disagree with this—using RLHF in a way that actually corresponds to some conditional of the original policy seems very important to me.
That said, it seems to me that in practice you often should be trying to converge to a zero entropy policy. The optimal action is not random; we want the model to be picking the output that looks best in light of its beliefs.
I think you probably expect that we’ll eventually be able to give good feedback even for quite complex tasks like “actually say the truth”—but if you don’t expect that to ever succeed (as I think I don’t), then staying close to existing data seems like a pretty important tool to prevent yourself from Goodharting on bad feedback.
It seems like ideal way forward is to more accurately capture what you actually care about, then optimize that
What if what you actually care about is predicting the world well? In that case, staying close to existing data actually seems like a very principled thing to do. In particular, the KL penalty here lets you interpret the feedback as just extracting a particular conditioned distribution, which is precisely the thing I think you wantto do with such a predictor.
I think I broadly disagree with this—using RLHF in a way that actually corresponds to some conditional of the original policy seems very important to me.
I think you probably expect that we’ll eventually be able to give good feedback even for quite complex tasks like “actually say the truth”—but if you don’t expect that to ever succeed (as I think I don’t), then staying close to existing data seems like a pretty important tool to prevent yourself from Goodharting on bad feedback.
What if what you actually care about is predicting the world well? In that case, staying close to existing data actually seems like a very principled thing to do. In particular, the KL penalty here lets you interpret the feedback as just extracting a particular conditioned distribution, which is precisely the thing I think you want to do with such a predictor.