Lech Mazur comments on Will we run out of ML data? Evidence from projecting dataset size trends

Lech Mazur 14 Nov 2022 19:13 UTC
13 points
11
Generated data can be low quality but indistinguishable. Unless your classifier has access to more data or is better in some other way (e.g. larger, better architecture), you won’t know. In fact, if you could know without labeling generated data, why would you generate something that you can tell is bad in the first place? I’ve seen this in practice in my own project.
- tgb 15 Nov 2022 1:57 UTC
  2 points
  0
  Parent
  Surely the problem is that someone else is generating it—or more accurately lots of other people generating it in huge quantities.
  - Lech Mazur 15 Nov 2022 3:08 UTC
    1 point
    0
    Parent
    While there are still significant improvements in data/model/generation you might be able to imperfectly detect whether some text was generated by the previous generation of models. But if you’re training a new model, you probably don’t have such a next-gen classifier ready yet. So if you want to do just one training run, it could be easier to just limit your training data to the text that was available years ago or to only trust some sources.
    
    A related issue is the use of AI writing assistants that fix grammar and modify human-written text in other ways that the language model considers better. While it seems like a less important problem, they could make the human-written text somewhat harder to distinguish from the AI-written text from the other direction.
    - tgb 15 Nov 2022 13:03 UTC
      2 points
      0
      Parent
      I agree, I was just responding to your penultimate sentence: “In fact, if you could know without labeling generated data, why would you generate something that you can tell is bad in the first place?”
      Personally, I think it’s kind of exciting to be part of what might be the last breath of purely human writing. Also, depressing.
- mocny-chlapik 15 Nov 2022 12:53 UTC
  0 points
  −4
  Parent
  In general, LM-generated text is still easily distinguishable by other LMs. Even though we humans can not tell the difference, the way they generate text is not really human-like. They are much more predictable, simply because they are not trying to convey information as humans do, they are guessing the most probable sequence of tokens.
  Humans are less predictable, because they have always something new to say, LMs on the other hand are like the most cliche person ever.
  - justinpombrio 15 Nov 2022 14:13 UTC
    3 points
    0
    Parent
    You should try turning the temperature up.