habryka comments on Thoughts on the impact of RLHF research

habryka 20 Feb 2023 21:37 UTC
LW: 9 AF: 5
5
AF
I think the qualitative difference between the supervised tuning done in text-davinci-002 and the RLHF in text-davinci-003 is modest (e.g. I’ve seen head-to-head comparisons suggesting real but modest effects on similar tasks).
Ok, I think we might now have some additional data on this debate. It does indeed look like to me that Sydney was trained with the next best available technology after RLHF, for a few months, at least based on Gwern’s guesses here: https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned?commentId=AAC8jKeDp6xqsZK2K
As far as I can tell this resulted in a system with much worse economic viability than Chat-GPT. I would overall describe Sydney as “economically unviable”, such that if Gwern’s story here is correct, the difference between using straightforward supervised training on chat transcripts and OpenAIs RLHF pipeline is indeed the difference between an economically viable and unviable product.
There is a chance that Microsoft fixes this with more supervised training, but my current prediction is that they will have to fix this with RLHF, because the other technological alternatives are indeed no adequate substitutes from an economic viability perspective, which suggests that the development of RLHF did really matter a lot for this.
- gwern 20 Feb 2023 21:47 UTC
  LW: 8 AF: 5
  3
  AF Parent
  Benchmarking on static datasets on ordinary tasks (typically not even adversarially collected in the first place) may not be a good way to extrapolate to differences in level of abuse for PR-sensitive actors like megacorps, especially for abusers that are attacking the retrieval functionality (as Sydney users explicitly were trying to populate Bing hits to steer Sydney), a functionality not involved in said benchmarking at all. Or to put it another way, the fact that text-davinci-003 does only a little better than text-davinci-002 in terms of accuracy % may tell you little about how profitable in $ each will be once 4chan & the coomers get their hands on it… It is not news to anyone here that average-case performance on proxy metrics on some tame canned datasets may be unrelated to out-of-distribution robustness on worst-case adversary-induced decision-relevant losses, in much the same way that model perplexity tells us little about what a model is useful for or how vulnerable it is.
  - habryka 20 Feb 2023 22:56 UTC
    LW: 3 AF: 2
    0
    AF Parent
    Yeah, this is basically my point. Not sure whether whether you are agreeing or disagreeing. I was specifically quoting Paul’s comment saying “I’ve seen only modest qualitative differences” in order to disagree and say “I think we’ve now seen substantial qualitative differences”.
    We have had 4chan play around with Chat-GPT for a while, with much less disastrous results than what happened when they got access to Sydney.
    It is not news to anyone here that average-case performance on proxy metrics on some tame canned datasets may be unrelated to out-of-distribution robustness on worst-case adversary-induced decision-relevant losses, in much the same way that model perplexity tells us little about what a model is useful for or how vulnerable it is.
    I wish that this not being news to anyone here was true but this does not currently seem true to me. But doesn’t seem worth going into.
    - gwern 21 Feb 2023 1:23 UTC
      LW: 5 AF: 4
      3
      AF Parent
      I was elaborating in more ML-y jargon, and also highlighting that there are a lot of wildcards omitted from Paul’s comparison: retrieval especially was an interesting dynamic.
- LawrenceC 20 Feb 2023 21:46 UTC
  LW: 8 AF: 4
  2
  AF Parent
  For what it’s worth, I buy the claim from Gwern that Microsoft trained Sydney pretty poorly, much worse than is achievable with SFT on highly rated data. For example, Sydney shows significant repetition, which you don’t see even on text-davinci-002 or (early 2022) LaMDA, both trained without RLHF.
  - habryka 20 Feb 2023 22:58 UTC
    LW: 4 AF: 2
    0
    AF Parent
    Yep, I think it’s pretty plausible this is just a data-quality issue, though I find myself somewhat skeptical of this. Maybe worth a bet?
    I would be happy to bet that conditional on them trying to solve this with more supervised training and no RLHF, we are going to see error modes substantially more catastrophic than current Chat-GPT.