I hesitantly disagree because I think that releasing reward models will cause them to get over optimized and the backlash against the overoptimization may more than undo the optimization on the world. I would suggest instead releasing the models that have been trained on those rewards, because, for example, anthropic’s models would give much better advice than td3.
I think I don’t completely understand the objection? Is your concern that organizations who are less competent will over-fit to the reward models during fine-tuning, and so give worse models than OpenAI/Anthropic are able to train? I think this is a fair objection, and one argument for open-sourcing the full model.
My main goal with this post is to advocate for it being good to “at least” open source the reward models, and that the benefits of doing this would far outweigh the costs, both societally and for the organizations doing the open-sourcing. I tend to think that completely unaligned models will get more backlash than imperfectly aligned ones, but maybe this is incorrect. I haven’t thought deeply about whether it is safe/good for everyone to open source the underlying capability model.
people are open sourcing highly capable models already but those models have not been trained on particularly meaning aligned corpuses. releasing reward models would allow people to change the meaning of words in ways that cause the reward models to rank their words highly but which are actually dishonest, this is currently called SEO. however if trying to optimize for getting your ideas repeated by a language model is the new target, then your language model objective being open source will mess with things. idk mate just my ramblings I guess
I think this is a fair point that an open reward function is subject to “SEO” efforts to game it. But, how about a “training” reward function that is open, and a “test” reward function that is hidden?
I would love to know what are some other OSS efforts on reward function (I do follow Carper’s development on RF), and love to contribute.
I hesitantly disagree because I think that releasing reward models will cause them to get over optimized and the backlash against the overoptimization may more than undo the optimization on the world. I would suggest instead releasing the models that have been trained on those rewards, because, for example, anthropic’s models would give much better advice than td3.
I think I don’t completely understand the objection? Is your concern that organizations who are less competent will over-fit to the reward models during fine-tuning, and so give worse models than OpenAI/Anthropic are able to train? I think this is a fair objection, and one argument for open-sourcing the full model.
My main goal with this post is to advocate for it being good to “at least” open source the reward models, and that the benefits of doing this would far outweigh the costs, both societally and for the organizations doing the open-sourcing. I tend to think that completely unaligned models will get more backlash than imperfectly aligned ones, but maybe this is incorrect. I haven’t thought deeply about whether it is safe/good for everyone to open source the underlying capability model.
people are open sourcing highly capable models already but those models have not been trained on particularly meaning aligned corpuses. releasing reward models would allow people to change the meaning of words in ways that cause the reward models to rank their words highly but which are actually dishonest, this is currently called SEO. however if trying to optimize for getting your ideas repeated by a language model is the new target, then your language model objective being open source will mess with things. idk mate just my ramblings I guess
I think this is a fair point that an open reward function is subject to “SEO” efforts to game it. But, how about a “training” reward function that is open, and a “test” reward function that is hidden?
I would love to know what are some other OSS efforts on reward function (I do follow Carper’s development on RF), and love to contribute.