Appending a reward modelling system to GPT-2 directly has already been done—humans were asked to select from among GPT-2 outputs according to some criteria, a reward model was trained on the human selections, and then was applied to train GPT-2. Based on what you’ve just said, this method is just a much faster, more efficient way of getting a GPT to adapt to perform a recurrent task (since it uses a reward model trained on a few examples of human evaluation, instead of waiting for GPT to adapt by itself to many human selections as you suggest).
We have demonstrated RL fine-tuning of language models tofour NLP tasks: stylistic continuation with high sentiment orphysically descriptive language, and summarization on theCNN/Daily Mail and TL;DR datasets. Rather than buildingtask-specific techniques, we achieve our results by straight-forwardly applying reward learning to language generation.
We extend previous reward learning work with pretrainedmodels and KL regularization to prevent the policy fromdiverging too far from natural language.Our results are mixed. On the continuation tasks we achievegood results vs. the zero-shot baseline as evaluated by hu-mans with very few samples: 2.5k for sentiment and 5kfor descriptiveness. However, for both summarization tasksour policies are only “smart copiers” (extractive rather thanabstractive): they copy from the input text but skip overirrelevant preamble.
No-one has done this reward modelling technique for GPT-3 yet, but it should be trivial since the exact method used for GPT-2 should work. The method notably didn’t work as well when used to improve GPT-2 output on more complicated tasks (good on sentiment biasing, mediocre on summarization), but that’s because GPT-2 wasn’t coherent enough over long enough ranges to properly evaluate the rewards from a reward model representing some complex task or concept. With GPT-3, you might be able to use the reward modelling method to get it to focus on more complicated concepts, or get it to be more ‘factually accurate and on-topic’. If you had the humans evaluate ‘accurate and on-topic’ and built up such a reward model that might be a way to ‘bring out’ the knowledge GPT-3 has but sometimes doesn’t use. I think it would be just like this, but with the reward model helping you get more mileage out of each q/a pair in your buffer by generalising over it a bit:
Allow GPT to answer the next query.
Allow GPT to predict the evaluation.
If the evaluation returns as TRUE append the the q/a pair to a buffer
If buffer is large enough append to context and repeat
Perhaps you’d run into trouble needing a complicated or sophisticated reward model to get much extra mileage out of each new query, but given that it already worked with GPT-2 on simple tasks it might do well with GPT-3 on complex tasks. Essentially, everything you said—except we already have solid evidence that big parts of it can be automated and therefore likely achieved quicker than would otherwise be expected.
Appending a reward modelling system to GPT-2 directly has already been done—humans were asked to select from among GPT-2 outputs according to some criteria, a reward model was trained on the human selections, and then was applied to train GPT-2. Based on what you’ve just said, this method is just a much faster, more efficient way of getting a GPT to adapt to perform a recurrent task (since it uses a reward model trained on a few examples of human evaluation, instead of waiting for GPT to adapt by itself to many human selections as you suggest).
No-one has done this reward modelling technique for GPT-3 yet, but it should be trivial since the exact method used for GPT-2 should work. The method notably didn’t work as well when used to improve GPT-2 output on more complicated tasks (good on sentiment biasing, mediocre on summarization), but that’s because GPT-2 wasn’t coherent enough over long enough ranges to properly evaluate the rewards from a reward model representing some complex task or concept. With GPT-3, you might be able to use the reward modelling method to get it to focus on more complicated concepts, or get it to be more ‘factually accurate and on-topic’. If you had the humans evaluate ‘accurate and on-topic’ and built up such a reward model that might be a way to ‘bring out’ the knowledge GPT-3 has but sometimes doesn’t use. I think it would be just like this, but with the reward model helping you get more mileage out of each q/a pair in your buffer by generalising over it a bit:
Perhaps you’d run into trouble needing a complicated or sophisticated reward model to get much extra mileage out of each new query, but given that it already worked with GPT-2 on simple tasks it might do well with GPT-3 on complex tasks. Essentially, everything you said—except we already have solid evidence that big parts of it can be automated and therefore likely achieved quicker than would otherwise be expected.
[Deleted]