I’m guessing that the sort of data that’s crawled by Google but not common crawl is usually low quality? I imagine that if somebody put any effort into writing something then they’ll put effort into making sure it’s easily accessible, and the stuff that’s harder to get to is usually machine generated?
Of course that’s excluding all the data that’s private. I imagine that once you add private messages (e.g. WhatsApp, email) + internal documents that ends up being far bigger than the publicly available web.
Unclear. I think there is a correlation, but: one determinant of crawl completeness/quality/etc is choice of seeds. It is known that Internet Archive crawl has better Chinese data than Common Crawl, because they made specific effort to improve seeds for Chinese web. Such missing data originating from choice of seeds bias probably is not particularly low quality than average of what is in Common Crawl.
(That is, to clarify, yes in general effort is spent for quality writing to be easily accessible (hence easily crawlable), but accessibility is relative to choice of seeds, and it in fact is the case that being easily accessible from Chinese web does not necessarily entail being easily accessible from English web.)
I’m guessing that the sort of data that’s crawled by Google but not common crawl is usually low quality? I imagine that if somebody put any effort into writing something then they’ll put effort into making sure it’s easily accessible, and the stuff that’s harder to get to is usually machine generated?
Of course that’s excluding all the data that’s private. I imagine that once you add private messages (e.g. WhatsApp, email) + internal documents that ends up being far bigger than the publicly available web.
Unclear. I think there is a correlation, but: one determinant of crawl completeness/quality/etc is choice of seeds. It is known that Internet Archive crawl has better Chinese data than Common Crawl, because they made specific effort to improve seeds for Chinese web. Such missing data originating from choice of seeds bias probably is not particularly low quality than average of what is in Common Crawl.
(That is, to clarify, yes in general effort is spent for quality writing to be easily accessible (hence easily crawlable), but accessibility is relative to choice of seeds, and it in fact is the case that being easily accessible from Chinese web does not necessarily entail being easily accessible from English web.)