This makes sense to me and is what I’ve been considering as the implication of sample-efficiency (one of the blessings of scale), coming at it from another direction of meta-learning/Bayesian RL: if your model gets more sample-efficient as it gets larger & n gets larger, it’s because it’s increasingly approaching a Bayes-optimal learner and so it gets more out of the more data, but then when you hit the Bayes-limit, how are you going to learn more from each datapoint? You have to switch over to a different and inferior scaling law. You can’t squeeze blood from a stone; once you approach the intrinsic entropy, there’s not much to learn. Steeply diminishing returns is built into compiling large text datasets and just training on random samples. It looks like the former is the regime we’ve been in up to GPT-3 and beyond, and the latter is when the slower data-only scaling kicks in.
Aside from multimodal approaches, the crossover raises the question of whether it becomes time to invest in improvements like active learning. What scaling curve in L(D)/L(C) could we get with even a simple active learning approach like running a small GPT over Common Crawl and throwing out datapoints which are too easily predicted?
What scaling curve in L(D)/L(C) could we get with even a simple active learning approach like running a small GPT over Common Crawl and throwing out datapoints which are too easily predicted?
IIUC, this is trying to make L(D) faster by making every data point more impactful (at lowering test loss). This will help if
you get most of the way to intrinsic entropy L(D) on your first pass over D points
you can downsample your full dataset without lowering the total number of examples seen in training, i.e. you have too many points to do one full epoch over them
I can imagine this regime becoming the typical one for non-text modalities like video that have huge data with lots of complex redundancy (which the model will learn to compress).
With text data, though, I’m concerned that (2) will fail soon.
The number of train steps taken by GPT-3 was the same order of magnitude as the size of Common Crawl. I haven’t seen convincing evidence that comparably good/diverse text datasets can be constructed which are 10x this size, 100x, etc. The Pile is an interesting experiment, but they’re mostly adding large quantities of single-domain text like Github, which is great for those domains but won’t help outside them.
The Pile is an interesting experiment, but they’re mostly adding large quantities of single-domain text like Github, which is great for those domains but won’t help outside them.
I disagree. Transfer learning is practically the entire point. ‘Blessings of scale’ etc.
I disagree. Transfer learning is practically the entire point. ‘Blessings of scale’ etc.
Sure—my point to contrast two cases
a counterfactual world with a much larger “regular” web, so WebText and Common Crawl are 1000x their real size
the real world, where we have to go beyond “regular” web scrapes to add orders of magnitude
Many, including OpenAI, argue that general web crawls are a good way to get high domain diversity for free. This includes domains the research would never have come up with themselves.
If we switch to manually hunting down large specialized datasets, this will definitely help, but we’re no longer getting broad domain coverage for free. At best we get broad domain coverage through manual researcher effort and luck, at worst we don’t get it at all.
I see your point about active learning “telling us” when we need more data—that’s especially appealing if it can point us to specific domains where more coverage would help.
I think I see ‘domain-specific datasets’ as broader than you do. You highlight Github, and yet, when I think of Github, I think of thousands of natural and artificial languages, tackling everything related to software in the world (which is increasingly ‘everything’), by millions of people, doing things like uploading banned books for evading the Great Firewall or organizing protests against local officials, filing bugs and discussing things back and forth, often adversarially, all reliant on common sense and world knowledge. A GPT trained on Github at hundreds of gigabytes I would expect to induce meta-learning, reasoning, and everything else, for exactly the same reasons CC/books1/books2/WP do; yes, it would know ‘source code’ well (not a trivial thing in its own right), but that is a mirror of the real world. I see plenty of broad domain coverage from ‘just’ Github, or ‘just’ Arxiv. (Literotica, I’m less sure about.) I don’t see Github as having much of a disadvantage over CC in terms of broadness or what a model could learn from it. Indeed, given what we know about CC’s general quality and how default preprocessing can screw it up (I see a lot of artifacts in GPT-3′s output I think are due to bad preprocessing), I expect Github to be more useful than an equivalent amount of CC!
(It’s true Codex does not do this sort of thing beyond what it inherits from GPT-3 pretraining. But that’s because it is aimed solely at programming, and so they deliberately filter out most of Github by trying to detect Python source files and throw away everything else etc etc, not because there’s not an extremely diverse set of data available on raw Github.)
The big advantage of Common Crawl over a Github scrape is that, well, CC already exists. Someone has to invest the effort at some point for all datasets, after all. You can go download pre-cleaned versions of it—aside from EleutherAI’s version (which they expect to be substantially better than CC on a byte for byte basis), Facebook and Google recently released big multilingual CC. But of course, now that they’ve done it and added it to the Pile, that’s no longer a problem.
if your model gets more sample-efficient as it gets larger & n gets larger, it’s because it’s increasingly approaching a Bayes-optimal learner and so it gets more out of the more data, but then when you hit the Bayes-limit, how are you going to learn more from each datapoint? You have to switch over to a different and inferior scaling law. You can’t squeeze blood from a stone; once you approach the intrinsic entropy, there’s not much to learn.
I found this confusing. It sort of seems like you’re assuming that a Bayes-optimal learner achieves the Bayes error rate (are you ?), which seems wrong to me.
What do you mean “the Bayes-limit”? At first, I assumed you were talking about the Bayes error rate (https://en.wikipedia.org/wiki/Bayes_error_rate), but that is (roughly) the error you coule expect to achieve with infinite data, and we’re still talking about finite data.
What do you mean “Bayes-optimal learner”? I assume you just mean something that performs Bayes rule exactly (so depends on the prior/data).
I’m confused by you talking about “approach[ing] the intrinsic entropy”… it seems like the figure in OP shows L(C) approaching L(D). But is L(D) supposed to represent intrinsic entropy? should we trust it as an estimate of intrinsic entropy?
I also don’t see how active learning is supposed to help (unless you’re talking about actively generating data)… I thought the whole point you were trying to make is that once you reach the Bayes error rate there’s literally nothing you can do to keep improving without more data. You talk about using active learning to throw out data-points… but I thought the problem was not having enough data? So how is throwing out data supposed to help with that?
Filtering for difficulty like that is tricky. In particular the most difficult samples are random noise or Chinese or something that the model can’t begin to comprehend.
Some approaches I would consider:
Curriculum learning—Have a bunch of checkpoints from a smaller GPT. Say the big GPT currently has a LM loss of 3. Then show it the examples where the smaller GPT’s loss improved most rapidly when its average loss was 3.
Quality—Put more effort into filtering out garbage and upsampling high quality corpuses like Wikipedia.
Retrieval—Let the model look things up when its confused, like MARGE from Pretraining via Paraphrasing does.
This makes sense to me and is what I’ve been considering as the implication of sample-efficiency (one of the blessings of scale), coming at it from another direction of meta-learning/Bayesian RL: if your model gets more sample-efficient as it gets larger & n gets larger, it’s because it’s increasingly approaching a Bayes-optimal learner and so it gets more out of the more data, but then when you hit the Bayes-limit, how are you going to learn more from each datapoint? You have to switch over to a different and inferior scaling law. You can’t squeeze blood from a stone; once you approach the intrinsic entropy, there’s not much to learn. Steeply diminishing returns is built into compiling large text datasets and just training on random samples. It looks like the former is the regime we’ve been in up to GPT-3 and beyond, and the latter is when the slower data-only scaling kicks in.
Aside from multimodal approaches, the crossover raises the question of whether it becomes time to invest in improvements like active learning. What scaling curve in L(D)/L(C) could we get with even a simple active learning approach like running a small GPT over Common Crawl and throwing out datapoints which are too easily predicted?
IIUC, this is trying to make L(D) faster by making every data point more impactful (at lowering test loss). This will help if
you get most of the way to intrinsic entropy L(D) on your first pass over D points
you can downsample your full dataset without lowering the total number of examples seen in training, i.e. you have too many points to do one full epoch over them
I can imagine this regime becoming the typical one for non-text modalities like video that have huge data with lots of complex redundancy (which the model will learn to compress).
With text data, though, I’m concerned that (2) will fail soon.
The number of train steps taken by GPT-3 was the same order of magnitude as the size of Common Crawl. I haven’t seen convincing evidence that comparably good/diverse text datasets can be constructed which are 10x this size, 100x, etc. The Pile is an interesting experiment, but they’re mostly adding large quantities of single-domain text like Github, which is great for those domains but won’t help outside them.
I disagree. Transfer learning is practically the entire point. ‘Blessings of scale’ etc.
Sure—my point to contrast two cases
a counterfactual world with a much larger “regular” web, so WebText and Common Crawl are 1000x their real size
the real world, where we have to go beyond “regular” web scrapes to add orders of magnitude
Many, including OpenAI, argue that general web crawls are a good way to get high domain diversity for free. This includes domains the research would never have come up with themselves.
If we switch to manually hunting down large specialized datasets, this will definitely help, but we’re no longer getting broad domain coverage for free. At best we get broad domain coverage through manual researcher effort and luck, at worst we don’t get it at all.
I see your point about active learning “telling us” when we need more data—that’s especially appealing if it can point us to specific domains where more coverage would help.
I think I see ‘domain-specific datasets’ as broader than you do. You highlight Github, and yet, when I think of Github, I think of thousands of natural and artificial languages, tackling everything related to software in the world (which is increasingly ‘everything’), by millions of people, doing things like uploading banned books for evading the Great Firewall or organizing protests against local officials, filing bugs and discussing things back and forth, often adversarially, all reliant on common sense and world knowledge. A GPT trained on Github at hundreds of gigabytes I would expect to induce meta-learning, reasoning, and everything else, for exactly the same reasons CC/books1/books2/WP do; yes, it would know ‘source code’ well (not a trivial thing in its own right), but that is a mirror of the real world. I see plenty of broad domain coverage from ‘just’ Github, or ‘just’ Arxiv. (Literotica, I’m less sure about.) I don’t see Github as having much of a disadvantage over CC in terms of broadness or what a model could learn from it. Indeed, given what we know about CC’s general quality and how default preprocessing can screw it up (I see a lot of artifacts in GPT-3′s output I think are due to bad preprocessing), I expect Github to be more useful than an equivalent amount of CC!
(It’s true Codex does not do this sort of thing beyond what it inherits from GPT-3 pretraining. But that’s because it is aimed solely at programming, and so they deliberately filter out most of Github by trying to detect Python source files and throw away everything else etc etc, not because there’s not an extremely diverse set of data available on raw Github.)
The big advantage of Common Crawl over a Github scrape is that, well, CC already exists. Someone has to invest the effort at some point for all datasets, after all. You can go download pre-cleaned versions of it—aside from EleutherAI’s version (which they expect to be substantially better than CC on a byte for byte basis), Facebook and Google recently released big multilingual CC. But of course, now that they’ve done it and added it to the Pile, that’s no longer a problem.
I found this confusing. It sort of seems like you’re assuming that a Bayes-optimal learner achieves the Bayes error rate (are you ?), which seems wrong to me.
What do you mean “the Bayes-limit”? At first, I assumed you were talking about the Bayes error rate (https://en.wikipedia.org/wiki/Bayes_error_rate), but that is (roughly) the error you coule expect to achieve with infinite data, and we’re still talking about finite data.
What do you mean “Bayes-optimal learner”? I assume you just mean something that performs Bayes rule exactly (so depends on the prior/data).
I’m confused by you talking about “approach[ing] the intrinsic entropy”… it seems like the figure in OP shows L(C) approaching L(D). But is L(D) supposed to represent intrinsic entropy? should we trust it as an estimate of intrinsic entropy?
I also don’t see how active learning is supposed to help (unless you’re talking about actively generating data)… I thought the whole point you were trying to make is that once you reach the Bayes error rate there’s literally nothing you can do to keep improving without more data.
You talk about using active learning to throw out data-points… but I thought the problem was not having enough data? So how is throwing out data supposed to help with that?
Filtering for difficulty like that is tricky. In particular the most difficult samples are random noise or Chinese or something that the model can’t begin to comprehend.
Some approaches I would consider:
Curriculum learning—Have a bunch of checkpoints from a smaller GPT. Say the big GPT currently has a LM loss of 3. Then show it the examples where the smaller GPT’s loss improved most rapidly when its average loss was 3.
Quality—Put more effort into filtering out garbage and upsampling high quality corpuses like Wikipedia.
Retrieval—Let the model look things up when its confused, like MARGE from Pretraining via Paraphrasing does.