people are open sourcing highly capable models already but those models have not been trained on particularly meaning aligned corpuses. releasing reward models would allow people to change the meaning of words in ways that cause the reward models to rank their words highly but which are actually dishonest, this is currently called SEO. however if trying to optimize for getting your ideas repeated by a language model is the new target, then your language model objective being open source will mess with things. idk mate just my ramblings I guess
I think this is a fair point that an open reward function is subject to “SEO” efforts to game it. But, how about a “training” reward function that is open, and a “test” reward function that is hidden?
I would love to know what are some other OSS efforts on reward function (I do follow Carper’s development on RF), and love to contribute.
people are open sourcing highly capable models already but those models have not been trained on particularly meaning aligned corpuses. releasing reward models would allow people to change the meaning of words in ways that cause the reward models to rank their words highly but which are actually dishonest, this is currently called SEO. however if trying to optimize for getting your ideas repeated by a language model is the new target, then your language model objective being open source will mess with things. idk mate just my ramblings I guess
I think this is a fair point that an open reward function is subject to “SEO” efforts to game it. But, how about a “training” reward function that is open, and a “test” reward function that is hidden?
I would love to know what are some other OSS efforts on reward function (I do follow Carper’s development on RF), and love to contribute.