Publicly available information suggests that the mystery method may not be so different from RLHF.
Actually, text-davinci-001 and text-davinci-002 are trained with supervised fine-tuning, according to the new documentation from late November 2022, with no apparent use of reinforcement learning:
FeedME Supervised fine-tuning on human-written demonstrations and on model samples rated 7⁄7 by human labelers on an overall quality score
The SFT and PPO models are trained similarly to the ones from the InstructGPT paper. FeedME (short for “feedback made easy”) models are trained by distilling the best completions from all of our models.
Only text-davinci-003 is trained with “reinforcement learning with reward models trained from comparisons by humans”, and that was released after this post was written.
Actually, text-davinci-001 and text-davinci-002 are trained with supervised fine-tuning, according to the new documentation from late November 2022, with no apparent use of reinforcement learning:
Only text-davinci-003 is trained with “reinforcement learning with reward models trained from comparisons by humans”, and that was released after this post was written.