How limiting are poor corpus quality and limited size for these models? For example, Megatron-Turing NLG was only trained on PubMed extracts but not on PubMed Central, which is part of the Pile dataset. Was this perhaps intentional or an oversight?
Regarding medical texts, I see many shortcomings of the training data. PubMed Central is much smaller at 5 million entries than the whole PubMed corpus at 30 million entries, which seems to be unavailable due to copyright issues. However, perhaps bigger is not better?
Regarding books, how relevant is the limited size and breadth of the corpus? Books3 contains 200k books out of some 10 million that should be available via amazon.
Regarding copyright, could a less ethical actor gain an advantage by incorporating sci-hub and libgen content, for example? These two together claim to include 50 million medical articles and another 2 million books.
Not very, now. Compute & implementation are more limiting.
Corpus quality definitely affects how much bang per FLOP you get, at a moderate (neither extreme nor negligible) sort of constant factor. The Pile is better than what OA used, so an otherwise-identical GPT-3 trained on it would be noticeably better. (But this only goes so far and you will have a hard time doing much better than The Pile in terms of cleaner higher-quality text.)
Corpus size is unimportant because existing corpuses are enough: the optimal training method is to do 1 epoch, never reusing any data. But you are usually limited by compute, and the compute you have uniquely specifies the model size and n. Even for Nvidia/MS using their supercomputer with 5k+ A100s, that n is a lot smaller than what The Pile etc already contains. (See the part about over/undersampling and how few tokens they trained on compared to the full dataset. GPT-3 did the same thing: oversampled WP and books, but then didn’t use more than a fraction of their CC subset etc.) So in other words, PMC vs PM is irrelevant because even if you bothered to get the full PM corpus, they already had more text than they could afford to train on. They don’t want to throw out the other data in order to train on mostly PM, so just PMC is fine.
(When you have too much data, what you can do to get some value out of it is you can filter it more aggressively to cut it down to just that amount you can afford—but filtering itself is an open research challenge: lots of nasty software engineering, and you may wind up sabotaging your model if you eliminate too much data diversity by mistaking diverse difficult data for bad data & filtering it out. And if you do it right, you’re still limited by the first point that data cleaning only buys you so much in constant factors.)
I should clarify that we aren’t data-limited in the sense of large natural data dumps, but we are data-limited in other kinds of data in terms of triggering interesting latent capabilities.
In terms of raw data, The Pile and CC have more data than you need for the foreseeable future. This does not apply to other kinds of data, like curated sets of prompts. If you think of the pretraining paradigm, the point of large natural real world datadumps is not to be large or because we care about them or because the updates on 99% of the data will be useful, but that by virtue of their sheer size and indiscriminateness, they happen to contain, hidden throughout like flecks of gold in a giant river of sand, implicit unlabeled ‘hard tasks’ which foster generalization and capabilities through the blessing of scale. One might go through a gigabyte of text before finding an example which truly stresses a model’s understanding of “whether a kilogram of feathers weighs more than a kilogram of toasters”—these are simply weird things to write, and are mostly just implicit, and most examples are easily solved by shortcuts. The more easy examples you solve, the more gigabytes or terabytes you have to process in order to find a bunch of examples you haven’t already solved, and the bigger your model has to be to potentially absorb the remainder. So there are diminishing returns and you rapidly run out of compute before you run out of raw data.
However, if you can write down a few examples of each of those tasks and produce a highly concentrated dose of those tasks (by distilling the dumps, collating existing challenging benchmarks’ corpuses, recruiting humans to write targeted tasks, using adversarial methods to focus on weak points etc.), you can potentially bring to the surface a lot of learning and meta-learning. This is hard to do because we don’t know what most of those hard tasks are: they are the water in which we swim, and we don’t know what we know or how we know it (which is much of why AI is hard). But you can still try. This has been a very effective approach over the past year or so, and we have yet to see the limits of this approach: the more varied your prompts and tasks, the better models work.
How limiting are poor corpus quality and limited size for these models? For example, Megatron-Turing NLG was only trained on PubMed extracts but not on PubMed Central, which is part of the Pile dataset. Was this perhaps intentional or an oversight?
Regarding medical texts, I see many shortcomings of the training data. PubMed Central is much smaller at 5 million entries than the whole PubMed corpus at 30 million entries, which seems to be unavailable due to copyright issues. However, perhaps bigger is not better?
Regarding books, how relevant is the limited size and breadth of the corpus? Books3 contains 200k books out of some 10 million that should be available via amazon.
Regarding copyright, could a less ethical actor gain an advantage by incorporating sci-hub and libgen content, for example? These two together claim to include 50 million medical articles and another 2 million books.
Not very, now. Compute & implementation are more limiting.
Corpus quality definitely affects how much bang per FLOP you get, at a moderate (neither extreme nor negligible) sort of constant factor. The Pile is better than what OA used, so an otherwise-identical GPT-3 trained on it would be noticeably better. (But this only goes so far and you will have a hard time doing much better than The Pile in terms of cleaner higher-quality text.)
Corpus size is unimportant because existing corpuses are enough: the optimal training method is to do 1 epoch, never reusing any data. But you are usually limited by compute, and the compute you have uniquely specifies the model size and n. Even for Nvidia/MS using their supercomputer with 5k+ A100s, that n is a lot smaller than what The Pile etc already contains. (See the part about over/undersampling and how few tokens they trained on compared to the full dataset. GPT-3 did the same thing: oversampled WP and books, but then didn’t use more than a fraction of their CC subset etc.) So in other words, PMC vs PM is irrelevant because even if you bothered to get the full PM corpus, they already had more text than they could afford to train on. They don’t want to throw out the other data in order to train on mostly PM, so just PMC is fine.
(When you have too much data, what you can do to get some value out of it is you can filter it more aggressively to cut it down to just that amount you can afford—but filtering itself is an open research challenge: lots of nasty software engineering, and you may wind up sabotaging your model if you eliminate too much data diversity by mistaking diverse difficult data for bad data & filtering it out. And if you do it right, you’re still limited by the first point that data cleaning only buys you so much in constant factors.)
I should clarify that we aren’t data-limited in the sense of large natural data dumps, but we are data-limited in other kinds of data in terms of triggering interesting latent capabilities.
In terms of raw data, The Pile and CC have more data than you need for the foreseeable future. This does not apply to other kinds of data, like curated sets of prompts. If you think of the pretraining paradigm, the point of large natural real world datadumps is not to be large or because we care about them or because the updates on 99% of the data will be useful, but that by virtue of their sheer size and indiscriminateness, they happen to contain, hidden throughout like flecks of gold in a giant river of sand, implicit unlabeled ‘hard tasks’ which foster generalization and capabilities through the blessing of scale. One might go through a gigabyte of text before finding an example which truly stresses a model’s understanding of “whether a kilogram of feathers weighs more than a kilogram of toasters”—these are simply weird things to write, and are mostly just implicit, and most examples are easily solved by shortcuts. The more easy examples you solve, the more gigabytes or terabytes you have to process in order to find a bunch of examples you haven’t already solved, and the bigger your model has to be to potentially absorb the remainder. So there are diminishing returns and you rapidly run out of compute before you run out of raw data.
However, if you can write down a few examples of each of those tasks and produce a highly concentrated dose of those tasks (by distilling the dumps, collating existing challenging benchmarks’ corpuses, recruiting humans to write targeted tasks, using adversarial methods to focus on weak points etc.), you can potentially bring to the surface a lot of learning and meta-learning. This is hard to do because we don’t know what most of those hard tasks are: they are the water in which we swim, and we don’t know what we know or how we know it (which is much of why AI is hard). But you can still try. This has been a very effective approach over the past year or so, and we have yet to see the limits of this approach: the more varied your prompts and tasks, the better models work.