tgb comments on Will we run out of ML data? Evidence from projecting dataset size trends

tgb 15 Nov 2022 1:57 UTC
2 points
0
Surely the problem is that someone else is generating it—or more accurately lots of other people generating it in huge quantities.
- Lech Mazur 15 Nov 2022 3:08 UTC
  1 point
  0
  Parent
  While there are still significant improvements in data/model/generation you might be able to imperfectly detect whether some text was generated by the previous generation of models. But if you’re training a new model, you probably don’t have such a next-gen classifier ready yet. So if you want to do just one training run, it could be easier to just limit your training data to the text that was available years ago or to only trust some sources.
  
  A related issue is the use of AI writing assistants that fix grammar and modify human-written text in other ways that the language model considers better. While it seems like a less important problem, they could make the human-written text somewhat harder to distinguish from the AI-written text from the other direction.
  - tgb 15 Nov 2022 13:03 UTC
    2 points
    0
    Parent
    I agree, I was just responding to your penultimate sentence: “In fact, if you could know without labeling generated data, why would you generate something that you can tell is bad in the first place?”
    Personally, I think it’s kind of exciting to be part of what might be the last breath of purely human writing. Also, depressing.