Models that have been RLHF’d (so to speak), have different world priors in ways that aren’t really all that intuitive (see Janus’ work on mode collapse
Janus’ post on mode collapse is about text-davinci-002, which was trained using supervised fine-tuning on high-quality human-written examples (FeedME), not RLHF. It’s evidence that supervised fine-tuning can lead to weird output, not evidence about what RLHF does.
I haven’t seen evidence that RLHF’d text-davinci-003 appears less safe compared to the imitation-based text-davinci-002.
Janus’ post on mode collapse is about text-davinci-002, which was trained using supervised fine-tuning on high-quality human-written examples (FeedME), not RLHF. It’s evidence that supervised fine-tuning can lead to weird output, not evidence about what RLHF does.
I haven’t seen evidence that RLHF’d
text-davinci-003
appears less safe compared to the imitation-basedtext-davinci-002
.Refer my other reply here. And as the post mentions, RLHF also does exhibit mode collapse (check the section on prior work).