Dave Orr comments on Will we run out of ML data? Evidence from projecting dataset size trends

Dave Orr 14 Nov 2022 17:18 UTC
26 points
5
I think there’s another, related, but much worse problem.
As LLMs become more widely adopted, they will generate large amounts of text on the internet. This widely available text will become training data for future LLMs. Tons of low quality content will reinforce LLM proclivities to produce low quality content—or even if LLMs are generating high quality content, it will reinforce whatever tendencies and oddities they have, e.g. be permanently pegged to styles and topics of interest in 2010-2030.
This was a problem for translation. As google translate got better, people started posting translated versions of their website where the translation was from google. Then scrapers looking for parallel data to train on would find these, and it took a lot of effort to screen them out.
Accepting your estimates at face value, there are two problems: the availability of good training data may be a limiting factor; and good training data will be hard to find in a sea of computer generated content.
- Tamay 14 Nov 2022 18:49 UTC
  8 points
  5
  Parent
  If the data is low-quality and easily distinguishable from human-generated text, it should be simple to train a classifier to spot LM-generated text and exclude this from the training set. If it’s not possible to distinguish, then it should be high-enough quality so that including it is not a problem.
  
  ETA: As people point out below, this comment was glib and glosses over some key details; I don’t endorse this take anymore.
  - Lech Mazur 14 Nov 2022 19:13 UTC
    13 points
    11
    Parent
    Generated data can be low quality but indistinguishable. Unless your classifier has access to more data or is better in some other way (e.g. larger, better architecture), you won’t know. In fact, if you could know without labeling generated data, why would you generate something that you can tell is bad in the first place? I’ve seen this in practice in my own project.
    - tgb 15 Nov 2022 1:57 UTC
      2 points
      0
      Parent
      Surely the problem is that someone else is generating it—or more accurately lots of other people generating it in huge quantities.
      - Lech Mazur 15 Nov 2022 3:08 UTC
        1 point
        0
        Parent
        While there are still significant improvements in data/model/generation you might be able to imperfectly detect whether some text was generated by the previous generation of models. But if you’re training a new model, you probably don’t have such a next-gen classifier ready yet. So if you want to do just one training run, it could be easier to just limit your training data to the text that was available years ago or to only trust some sources.
        
        A related issue is the use of AI writing assistants that fix grammar and modify human-written text in other ways that the language model considers better. While it seems like a less important problem, they could make the human-written text somewhat harder to distinguish from the AI-written text from the other direction.
        tgb 15 Nov 2022 13:03 UTC
        2 points
        0
        Parent
        I agree, I was just responding to your penultimate sentence: “In fact, if you could know without labeling generated data, why would you generate something that you can tell is bad in the first place?”
        Personally, I think it’s kind of exciting to be part of what might be the last breath of purely human writing. Also, depressing.
    - mocny-chlapik 15 Nov 2022 12:53 UTC
      0 points
      −4
      Parent
      In general, LM-generated text is still easily distinguishable by other LMs. Even though we humans can not tell the difference, the way they generate text is not really human-like. They are much more predictable, simply because they are not trying to convey information as humans do, they are guessing the most probable sequence of tokens.
      Humans are less predictable, because they have always something new to say, LMs on the other hand are like the most cliche person ever.
      - justinpombrio 15 Nov 2022 14:13 UTC
        3 points
        0
        Parent
        You should try turning the temperature up.
- Tenoke 15 Nov 2022 23:04 UTC
  7 points
  2
  Parent
  There is little reason to think that’s a big issue. A lot of data is semi-tagged, some of the ML-generated data can be removed either that way or by being detected by newer models. And in general as long as the ‘good’ type of data is also increasing model quality will also keep increasing even if you have some extra noise.
- sanxiyn 15 Nov 2022 1:51 UTC
  4 points
  0
  Parent
  Then scrapers looking for parallel data to train on would find these, and it took a lot of effort to screen them out.
  How did Google screen them out? Is there a paper published on this? It seems potentially important.
  - Dave Orr 18 Nov 2022 20:22 UTC
    3 points
    0
    Parent
    Er, I’m not sure it’s been published so I guess I shouldn’t give details. It had to be an automatic solution because human curation couldn’t scale to the size of the problem.
  - Ericf 16 Nov 2022 19:13 UTC
    1 point
    0
    Parent
    Guessing that it involved a human checking the website and/or sending an email to find out if the author had used a translation tool