Important correction: text-davinci-002 was probably not trained with RLHF, but a “slightly different” method. I have not corrected the previous text of this post, but I’ve added a section at the beginning with further details on this update.
Btw. the confusion goes directly from OA’s paper (emphasis mine):
We focus on fine-tuning approaches to aligning language models. Specifically, we use reinforcement learning from human feedback (RLHF; Christiano et al., 2017; Stiennon et al., 2020) to fine-tune GPT-3 to follow a broad class of written instructions (see Figure 2). This technique uses human preferences as a reward signal to fine-tune our models. We first hire a team of 40 contractors to label our data, based on their performance on a screening test (see Section 3.4 and Appendix B.1 for more details). We then collect a dataset of human-written demonstrations of the desired output behavior on (mostly English) prompts submitted to the OpenAI API3 and some labeler-written prompts, and use this to train our supervised learning baselines. Next, we collect a dataset of human-labeled comparisons between outputs from our models on a larger set of API prompts. We then train a reward model (RM) on this dataset to predict which model output our labelers would prefer. Finally, we use this RM as a reward function and fine-tune our supervised learning baseline to maximize this reward using the PPO algorithm (Schulman et al., 2017). We illustrate this process in Figure 2. This procedure aligns the behavior of GPT-3 to the stated preferences of a specific group of people (mostly our labelers and researchers), rather than any broader notion of “human values”; we discuss this further in Section 5.2. We call the resulting models InstructGPT.
but below they state that the following models are trained using “FeedME, Supervised fine-tuning on human-written demonstrations and on model samples rated 7⁄7 by human labelers on an overall quality score”: text-davinci-001, text-davinci-002, text-curie-001, text-babbage-001.
Important correction:
text-davinci-002
was probably not trained with RLHF, but a “slightly different” method. I have not corrected the previous text of this post, but I’ve added a section at the beginning with further details on this update.Btw. the confusion goes directly from OA’s paper (emphasis mine):
and their docs:
but below they state that the following models are trained using “FeedME, Supervised fine-tuning on human-written demonstrations and on model samples rated 7⁄7 by human labelers on an overall quality score”: text-davinci-001, text-davinci-002, text-curie-001, text-babbage-001.