janus comments on Mysteries of mode collapse

janus 19 Nov 2022 22:52 UTC
LW: 1 AF: 1
0
AF
Important correction: text-davinci-002 was probably not trained with RLHF, but a “slightly different” method. I have not corrected the previous text of this post, but I’ve added a section at the beginning with further details on this update.
- Peter Hroššo 1 Feb 2023 15:37 UTC
  3 points
  0
  Parent
  Btw. the confusion goes directly from OA’s paper (emphasis mine):
  We focus on fine-tuning approaches to aligning language models. Specifically, we use reinforcement
  learning from human feedback (RLHF; Christiano et al., 2017; Stiennon et al., 2020) to fine-tune
  GPT-3 to follow a broad class of written instructions (see Figure 2). This technique uses human
  preferences as a reward signal to fine-tune our models. We first hire a team of 40 contractors to label
  our data, based on their performance on a screening test (see Section 3.4 and Appendix B.1 for more
  details). We then collect a dataset of human-written demonstrations of the desired output behavior
  on (mostly English) prompts submitted to the OpenAI API3 and some labeler-written prompts, and use this to train our supervised learning baselines. Next, we collect a dataset of human-labeled comparisons between outputs from our models on a larger set of API prompts. We then train a reward model (RM) on this dataset to predict which model output our labelers would prefer. Finally, we use this RM as a reward function and fine-tune our supervised learning baseline to maximize this reward using the PPO algorithm (Schulman et al., 2017). We illustrate this process in Figure 2. This procedure aligns the behavior of GPT-3 to the stated preferences of a specific group of people (mostly our labelers and researchers), rather than any broader notion of “human values”; we discuss this further in Section 5.2. We call the resulting models InstructGPT.
  and their docs:
  text-davinci-002 is an InstructGPT
  but below they state that the following models are trained using “FeedME, Supervised fine-tuning on human-written demonstrations and on model samples rated ⁷⁄₇ by human labelers on an overall quality score”: text-davinci-001, text-davinci-002, text-curie-001, text-babbage-001.