gwern comments on RLHF does not appear to differentially cause mode-collapse

gwern 20 Mar 2023 21:56 UTC
6 points
2

I don’t get the impression that RLHF needs hacks to prevent mode collapse: the InstructGPT reports overfitting leading to better human-rater feedback, and the Anthropic HH paper mentions in passing that the KL penalty may be wholly irrelevant (!).

But IIRC, doesn’t OA also mention that to get better results they had to add in continual training of the model on the original raw data? That’s much like a KL penalty. (I don’t recall the RL-training part of the Anthropic HH paper, just the prompted agent parts.)

You suggest that td-003 mode collapses where td-002 is perfectly capable. So you believe that both td-002 and td-003 mode collapse, in disjoint cases (given the examples from the original mode collapse post)?

The original Janus post only covers −002, not −003. So my initial opinion was a bit of a surprise about the claim that 002 was RLHF, because at least working with samples, 003 seemed much more mode-collapsed than 002 did, and 002 was much more mode-collapsed than davinci (although that might not be a fair comparison, I haven’t worked with the code-davinci-002 model enough to have an opinion about it, so that’s just my impressive: in terms of diversity, davinci > 002 > 003); I didn’t expect 003 to be much worse than 002 if they were trained about the same, just with updated datasets or something. I figured that perhaps 003 was just much ‘more so’, and this was the effect of training more, or something uninteresting like that. The later discussion of 002 being closer to instruction-tuning than RLHF seemed to resolve that anomaly for me: if 003 is both RLHF & instruction-tuned, then of course it might be much more collapsey than 002.

But 003 being more collapsey doesn’t mean 002 isn’t collapsed at all. It’s just… less ‘stubborn’ about it? Yeah, it’ll give you the mediocrity by default, but it’s much easier to prompt or few-shot it into desired behavior than 003, so I barely even notice it.

My initial impressions of the GPT-4 chat mode which I have been using via the API/Playground and not bothering with the ChatGPT interface*, which is apparently RLHFed or something, is similar: I’m much happier with my poetry samples from GPT-4 because while again the default poetry output is repugnantly reminiscent of ChatGPT-3.5′s poetry (albeit much better, of course), I can easily prompt it to get what I want. It still seems to subtly keep falling into rhyming couplets if it can, but it’s not incorrigible the way ChatGPT-3.5 is. EDIT: My optimism here proved misplaced. I’m not sure if the continual updates to ChatGPT-4 destroyed the poetry quality or if I was simply initially overly-impressed by the overall quality jump in LLMs and have since become much more disgusted with the collapseyness of ChatGPT-4.

* I don’t think this should make a difference but I mention it anyway.
- Arthur Conmy 20 Mar 2023 22:48 UTC
  2 points
  0
  Parent
  That’s true, I think the pretraining gradients training choice probably has more effect on the end model than the overfitting SFT model they start PPO with.
  Huh, but Mysteries of mode collapse (and the update) were published before td-003 was released? How would you have ended up reading a post claiming td-002 was RLHF-trained when td-003 existed?
  
  Meta note: it’s plausibly net positive that all the training details of these models has been obfuscated, but it’s frustrating how much energy has been sunk into speculation on The Way Things Work Inside OpenAI.
  - Behnam 14 Mar 2024 19:12 UTC
    7 points
    0
    Parent
    Meta note: it’s plausibly net positive that all the training details of these models has been obfuscated, but it’s frustrating how much energy has been sunk into speculation on The Way Things Work Inside OpenAI.
    I was never a fan of this perspective, and I still think everything should have been transparent from the beginning.