Would you say Learning to Summarize is an example of this? https://arxiv.org/abs/2009.01325
It’s model based RL because you’re optimizing against the model of the human (ie the reward model). And there are some results at the end on test-time search.
Or do you have something else in mind?
Would you say Learning to Summarize is an example of this? https://arxiv.org/abs/2009.01325
It’s model based RL because you’re optimizing against the model of the human (ie the reward model). And there are some results at the end on test-time search.
Or do you have something else in mind?