sanxiyn comments on Will we run out of ML data? Evidence from projecting dataset size trends

sanxiyn 15 Nov 2022 1:51 UTC
4 points
0
Then scrapers looking for parallel data to train on would find these, and it took a lot of effort to screen them out.
How did Google screen them out? Is there a paper published on this? It seems potentially important.
- Dave Orr 18 Nov 2022 20:22 UTC
  3 points
  0
  Parent
  Er, I’m not sure it’s been published so I guess I shouldn’t give details. It had to be an automatic solution because human curation couldn’t scale to the size of the problem.
- Ericf 16 Nov 2022 19:13 UTC
  1 point
  0
  Parent
  Guessing that it involved a human checking the website and/or sending an email to find out if the author had used a translation tool