Modern AI is trained on a huge fraction of the internet
I want to push against this. The internet (or world wide web) is incredibly big. In fact we don’t know exactly how big it is, and measuring the size is a research problem!
When they say this what they mean is it is trained on a huge fraction of Common Crawl. Common Crawl is a crawl of world wide web that is free to use. But there are other crawls, and you could crawl world wide web yourself. Everyone uses Common Crawl because it is decent and crawling world wide web is itself a large engineering project.
But Common Crawl is not at all a complete crawl of world wide web. It is very far from being complete. For example, Google has its own proprietary crawl of world wide web (which you can access as Google cache). Probabilistic estimate of size of Google’s search index suggests it is 10x size of Common Crawl. And Google’s crawl is also not complete. Bing also has its own crawl.
It is known that Anthropic runs its own crawler called ClaudeBot. Web crawl is highly non-trivial engineering project, but it is also kind of well understood. (Although I heard that you continue to encounter new issues as you approach more and more extreme scales.) There are also failed search engines with their own web crawls and you could buy them.
There is also another independent web crawl that is public! Internet Archive has its own crawl, it is just less well known than Common Crawl. Recently someone made a use of Internet Archive crawl and analyzed overlap and difference with Common Crawl, see https://arxiv.org/abs/2403.14009.
If the data wall is a big problem, making a use of Internet Archive crawl is like the first obvious thing to try. But as far as I know, that 2024 paper is first public literature to do so. At least, any analysis of data wall should take into account both Common Crawl and Internet Archive crawl with overlap excluded, but I have never seen anyone doing this.
My overall point is that Common Crawl is not world wide web. It is not complete, and there are other crawls, both public and private. You can also crawl yourself, and we know AI labs do. How much does it help is unclear, but I think 10x is very likely, although probably not 100x.
I’m guessing that the sort of data that’s crawled by Google but not common crawl is usually low quality? I imagine that if somebody put any effort into writing something then they’ll put effort into making sure it’s easily accessible, and the stuff that’s harder to get to is usually machine generated?
Of course that’s excluding all the data that’s private. I imagine that once you add private messages (e.g. WhatsApp, email) + internal documents that ends up being far bigger than the publicly available web.
Unclear. I think there is a correlation, but: one determinant of crawl completeness/quality/etc is choice of seeds. It is known that Internet Archive crawl has better Chinese data than Common Crawl, because they made specific effort to improve seeds for Chinese web. Such missing data originating from choice of seeds bias probably is not particularly low quality than average of what is in Common Crawl.
(That is, to clarify, yes in general effort is spent for quality writing to be easily accessible (hence easily crawlable), but accessibility is relative to choice of seeds, and it in fact is the case that being easily accessible from Chinese web does not necessarily entail being easily accessible from English web.)
Thanks for this—helpful and concrete, and did change my mind somewhat. Of course, if it really is just 10x, in terms of orders of magnitude/hyper fast scaling we are pretty close to the wall.
I want to push against this. The internet (or world wide web) is incredibly big. In fact we don’t know exactly how big it is, and measuring the size is a research problem!
When they say this what they mean is it is trained on a huge fraction of Common Crawl. Common Crawl is a crawl of world wide web that is free to use. But there are other crawls, and you could crawl world wide web yourself. Everyone uses Common Crawl because it is decent and crawling world wide web is itself a large engineering project.
But Common Crawl is not at all a complete crawl of world wide web. It is very far from being complete. For example, Google has its own proprietary crawl of world wide web (which you can access as Google cache). Probabilistic estimate of size of Google’s search index suggests it is 10x size of Common Crawl. And Google’s crawl is also not complete. Bing also has its own crawl.
It is known that Anthropic runs its own crawler called ClaudeBot. Web crawl is highly non-trivial engineering project, but it is also kind of well understood. (Although I heard that you continue to encounter new issues as you approach more and more extreme scales.) There are also failed search engines with their own web crawls and you could buy them.
There is also another independent web crawl that is public! Internet Archive has its own crawl, it is just less well known than Common Crawl. Recently someone made a use of Internet Archive crawl and analyzed overlap and difference with Common Crawl, see https://arxiv.org/abs/2403.14009.
If the data wall is a big problem, making a use of Internet Archive crawl is like the first obvious thing to try. But as far as I know, that 2024 paper is first public literature to do so. At least, any analysis of data wall should take into account both Common Crawl and Internet Archive crawl with overlap excluded, but I have never seen anyone doing this.
My overall point is that Common Crawl is not world wide web. It is not complete, and there are other crawls, both public and private. You can also crawl yourself, and we know AI labs do. How much does it help is unclear, but I think 10x is very likely, although probably not 100x.
I’m guessing that the sort of data that’s crawled by Google but not common crawl is usually low quality? I imagine that if somebody put any effort into writing something then they’ll put effort into making sure it’s easily accessible, and the stuff that’s harder to get to is usually machine generated?
Of course that’s excluding all the data that’s private. I imagine that once you add private messages (e.g. WhatsApp, email) + internal documents that ends up being far bigger than the publicly available web.
Unclear. I think there is a correlation, but: one determinant of crawl completeness/quality/etc is choice of seeds. It is known that Internet Archive crawl has better Chinese data than Common Crawl, because they made specific effort to improve seeds for Chinese web. Such missing data originating from choice of seeds bias probably is not particularly low quality than average of what is in Common Crawl.
(That is, to clarify, yes in general effort is spent for quality writing to be easily accessible (hence easily crawlable), but accessibility is relative to choice of seeds, and it in fact is the case that being easily accessible from Chinese web does not necessarily entail being easily accessible from English web.)
Thanks for this—helpful and concrete, and did change my mind somewhat. Of course, if it really is just 10x, in terms of orders of magnitude/hyper fast scaling we are pretty close to the wall.