Then scrapers looking for parallel data to train on would find these, and it took a lot of effort to screen them out.
How did Google screen them out? Is there a paper published on this? It seems potentially important.
Er, I’m not sure it’s been published so I guess I shouldn’t give details. It had to be an automatic solution because human curation couldn’t scale to the size of the problem.
Guessing that it involved a human checking the website and/or sending an email to find out if the author had used a translation tool
How did Google screen them out? Is there a paper published on this? It seems potentially important.
Er, I’m not sure it’s been published so I guess I shouldn’t give details. It had to be an automatic solution because human curation couldn’t scale to the size of the problem.
Guessing that it involved a human checking the website and/or sending an email to find out if the author had used a translation tool