I disagree. Transfer learning is practically the entire point. ‘Blessings of scale’ etc.
Sure—my point to contrast two cases
a counterfactual world with a much larger “regular” web, so WebText and Common Crawl are 1000x their real size
the real world, where we have to go beyond “regular” web scrapes to add orders of magnitude
Many, including OpenAI, argue that general web crawls are a good way to get high domain diversity for free. This includes domains the research would never have come up with themselves.
If we switch to manually hunting down large specialized datasets, this will definitely help, but we’re no longer getting broad domain coverage for free. At best we get broad domain coverage through manual researcher effort and luck, at worst we don’t get it at all.
I see your point about active learning “telling us” when we need more data—that’s especially appealing if it can point us to specific domains where more coverage would help.
I think I see ‘domain-specific datasets’ as broader than you do. You highlight Github, and yet, when I think of Github, I think of thousands of natural and artificial languages, tackling everything related to software in the world (which is increasingly ‘everything’), by millions of people, doing things like uploading banned books for evading the Great Firewall or organizing protests against local officials, filing bugs and discussing things back and forth, often adversarially, all reliant on common sense and world knowledge. A GPT trained on Github at hundreds of gigabytes I would expect to induce meta-learning, reasoning, and everything else, for exactly the same reasons CC/books1/books2/WP do; yes, it would know ‘source code’ well (not a trivial thing in its own right), but that is a mirror of the real world. I see plenty of broad domain coverage from ‘just’ Github, or ‘just’ Arxiv. (Literotica, I’m less sure about.) I don’t see Github as having much of a disadvantage over CC in terms of broadness or what a model could learn from it. Indeed, given what we know about CC’s general quality and how default preprocessing can screw it up (I see a lot of artifacts in GPT-3′s output I think are due to bad preprocessing), I expect Github to be more useful than an equivalent amount of CC!
(It’s true Codex does not do this sort of thing beyond what it inherits from GPT-3 pretraining. But that’s because it is aimed solely at programming, and so they deliberately filter out most of Github by trying to detect Python source files and throw away everything else etc etc, not because there’s not an extremely diverse set of data available on raw Github.)
The big advantage of Common Crawl over a Github scrape is that, well, CC already exists. Someone has to invest the effort at some point for all datasets, after all. You can go download pre-cleaned versions of it—aside from EleutherAI’s version (which they expect to be substantially better than CC on a byte for byte basis), Facebook and Google recently released big multilingual CC. But of course, now that they’ve done it and added it to the Pile, that’s no longer a problem.
Sure—my point to contrast two cases
a counterfactual world with a much larger “regular” web, so WebText and Common Crawl are 1000x their real size
the real world, where we have to go beyond “regular” web scrapes to add orders of magnitude
Many, including OpenAI, argue that general web crawls are a good way to get high domain diversity for free. This includes domains the research would never have come up with themselves.
If we switch to manually hunting down large specialized datasets, this will definitely help, but we’re no longer getting broad domain coverage for free. At best we get broad domain coverage through manual researcher effort and luck, at worst we don’t get it at all.
I see your point about active learning “telling us” when we need more data—that’s especially appealing if it can point us to specific domains where more coverage would help.
I think I see ‘domain-specific datasets’ as broader than you do. You highlight Github, and yet, when I think of Github, I think of thousands of natural and artificial languages, tackling everything related to software in the world (which is increasingly ‘everything’), by millions of people, doing things like uploading banned books for evading the Great Firewall or organizing protests against local officials, filing bugs and discussing things back and forth, often adversarially, all reliant on common sense and world knowledge. A GPT trained on Github at hundreds of gigabytes I would expect to induce meta-learning, reasoning, and everything else, for exactly the same reasons CC/books1/books2/WP do; yes, it would know ‘source code’ well (not a trivial thing in its own right), but that is a mirror of the real world. I see plenty of broad domain coverage from ‘just’ Github, or ‘just’ Arxiv. (Literotica, I’m less sure about.) I don’t see Github as having much of a disadvantage over CC in terms of broadness or what a model could learn from it. Indeed, given what we know about CC’s general quality and how default preprocessing can screw it up (I see a lot of artifacts in GPT-3′s output I think are due to bad preprocessing), I expect Github to be more useful than an equivalent amount of CC!
(It’s true Codex does not do this sort of thing beyond what it inherits from GPT-3 pretraining. But that’s because it is aimed solely at programming, and so they deliberately filter out most of Github by trying to detect Python source files and throw away everything else etc etc, not because there’s not an extremely diverse set of data available on raw Github.)
The big advantage of Common Crawl over a Github scrape is that, well, CC already exists. Someone has to invest the effort at some point for all datasets, after all. You can go download pre-cleaned versions of it—aside from EleutherAI’s version (which they expect to be substantially better than CC on a byte for byte basis), Facebook and Google recently released big multilingual CC. But of course, now that they’ve done it and added it to the Pile, that’s no longer a problem.