Update to Mysteries of mode collapse: text-davinci-002 not RLHF
I (and many others) did not realize this before, but: text-davinci-002
and text-davinci-001
, the InstructGPT models on the OpenAI API, were not trained with RLHF (reinforcement learning from human feedback) as described in the InstructGPT paper, but a “similar but slightly different”[1] method that uses the same human feedback data. Apparently, this other method is not technically RLHF.
Since this update has potentially nontrivial implications for interpreting the phenomena exhibited by text-davinci-002
described in Mysteries of mode collapse (formerly titled “Mysteries of mode collapse due to RLHF”), I’m making this separate post for a signal boost.
I have not corrected the original text of “Mysteries of mode collapse due to RLHF”, but I’ve added a section at the beginning with further details on this update, copied here:
I have received evidence from multiple credible sources that text-davinci-002 was not trained with RLHF.
The rest of this post has not been corrected to reflect this update. Not much besides the title (formerly “Mysteries of mode collapse due to RLHF”) is affected: just mentally substitute “mystery method” every time “RLHF” is invoked as the training method of
text-davinci-002.
The observations of its behavior otherwise stand alone.This is kind of fascinating from an epistemological standpoint. I was quite surprised to learn that
text-davinci-002
was probably not trained with RLHF. I don’t remember exactly how ”text-davinci-002
is RLHF” got elevated to an unquestioned assumption in my mind. I might have mistook not being contradicted by people who I assumed were in the know as confirmation. I certainly did not expect to talk for months to dozens of people about odd behaviors I’ve observed in a well-known model “due to RLHF” without being contradicted in a world where the model in question wasn’t trained with RLHF, but that’s what happened.[2] It wasn’t just me either: the assumption thattext-davinci-002
(/text-davinci-001
) is InstructGPT is RLHF seems ambient (e.g. search “text-davinci-002 rlhf” on Twitter, this LW post, this article, and many others). I contributed to perpetuating this misinformation cascade, and for that I apologize.
text-davinci-002
‘s behaviors described in this post also contributed to my confidence because RLHF seemed to be a likely and potentially satisfying explanation. Its apparently unsubstantiated confidence in very specific outcomes seems antithetical to the outer objective of self-supervised learning, which is optimized by epistemic calibration, meaning the model’s entropy should be as high as possible while fitting the data. In contrast, as several comments have pointed out, it makes sense that RL kills entropy. The presence of “attractors” made me additionally suspect that optimization from non-myopic outcome-supervision was formative totext-davinci-002
’s psyche.Mode collapse and attractors do seem to also be caused by RLHF (see Dumbass policy pls halp and Inescapable wedding parties). So the update is that some other training method also gives rise to these phenomena, as they are manifested by
text-davinci-002
.Whether and how speculations concerning the causes of mode collapse/attractors should be affected depends on how
text-davinci-002
’s training method differs from RLHF.What is known about
text-davinci-002
’s training methodPublicly available information suggests that the mystery method may not be so different from RLHF. Just today I discovered this sidenote in OpenAI’s blog post Aligning Language Models to Follow Instructions:
The InstructGPT models deployed in the API are updated versions trained using the same human feedback data. They use a similar but slightly different training method that we will describe in a forthcoming publication.
AFAIK, this is all that OpenAI has published about the RLHF/mystery method diff. It says that the InstructGPT models (
text-davinci-001
andtext-davinci-002
) were trained using the same human feedback data as the method described in OpenAI’s RLHF paper.[3] But this “similar but slightly different” method is apparently sufficiently different to not qualify as RLHF!Pending further revelations, I suppose the lesson here was that I should have sustained more entropy in my belief state given the partial information I had. But what a demanding thing to ask! So much easier to promote an attractive hypothesis to the status of decisive fact and collapse the remainder than to hold a superposition in the mind.
- ^
Sidenote on OpenAI’s blog post, Aligning Language Models to Follow Instructions
- ^
the lack of epistemic vigilantes attacking an unsubstantiated assumption in the very title of this post on LessWrong is truly unbelievable!
- ^
which seems to confirm my suspicion about outcome-supervision
- Conjecture: a retrospective after 8 months of work by 23 Nov 2022 17:10 UTC; 180 points) (
- An issue with training schemers with supervised fine-tuning by 27 Jun 2024 15:37 UTC; 49 points) (
- Steering Behaviour: Testing for (Non-)Myopia in Language Models by 5 Dec 2022 20:28 UTC; 40 points) (
OpenAI has just released a description of how their models work here.
text-davinci-002 is trained with “FeedME” and text-davinci-003 is trained with RLHF (PPO).
“FeedME” is what they call supervised fine-tuning on human-written demonstrations or model samples rated 7⁄7 by human labelers. So basically fine-tuning on high-quality data.
I think your findings are still very interesting. Because they imply that even further finetuning, changes the distribution significantly! Given all this information one could now actually run a systematic comparison of davinci, text-davinci-002 (finetuning), and text-davinci-003 (RLHF) and see how the distribution changes on various tasks.
Let me know if you want help on this, I’m interested in this myself.
In this thread, I asked Jan Leike what kind of model generates the samples that go into the training data if rated 7⁄7, and he answered “A mix of previously trained models. Probably very few samples from base models if any” (emphasis mine).
I’m curious to know whether/which of the behaviors described in this post appear in the models that generated the samples vs emerge at the supervised finetuning step.
Hypothetically, if a model trained with RLHF generates the samples and that model has the same modes/attractors, it probably makes sense to say that RLHF was responsible for shaping those behaviors and finetuning only “cloned” them.
(Note that it wasn’t specified how the previously trained non-base models actually were trained, leaving open the possibilities of RLHF models, models fine tuned on only human data, earlier iterations of FeedME models, or something entirely different.)
That’s interesting!
Yeah, I agree with that assessment. One important difference in RLHF vs fine-tuning is that the former basically generates the training distribution it then trains on. So, the LM will generate an output, and update its gradients based on the reward of that output. So intuitively I think it has a higher likelihood to be optimized towards certain unwanted attractors since the reward model will shape the future outputs it then learns from.
With fine-tuning you are just cloning a fixed distribution, and not influencing it (as you say). So I tend to agree that probably unwanted attractors could likely be due to the outputs of RLHF-trained models. I think that we need empirical evidence for this though (to be certain).
Given your statement, I also think that doing those experiments with GPT-3 models is gonna be hard because we basically have no way of telling what data it learned from, how it was generated, etc. So one would need to be more scientific and train various models with various optimization schemes, on known data distributions.
Very useful update, thanks.
Though I notice they don’t say anything about how ada and text-ada-* models were trained.
Thanks for catching this and spreading the word!
Do we know if the following other models from OpenAI use true RLHF or also use this RLHF-like mystery method? (or something else!)
text-curie-001
text-babbage-001
text-ada-001
The new model index from OpenAI contains most of the answers to this. Jérémy linked to it in another comment on this post. However, the model index doesn’t give info on ada and text-ada-001 yet: https://beta.openai.com/docs/model-index-for-researchers
I don’t know :(
There was a recent Twitter thread about this. See here and here.