I’m generally unclear on what the scope of the empirical discovery is. (I’m also not particularly knowledgeable about machine learning.) Do we have reason to think that it applies in domains outside text completion? Does it apply to models that don’t use transformers? (Is that even a thing now?) Does it apply across all the other bazillion parameters that go into a particular model, like, I dunno, the learning rate, or network width vs depth?
The answer to each these questions is either “yes” or “tentatively, yes.”
But the evidence doesn’t come from the Chinchilla paper. It comes from the earlier Kaplan et al papers, to which the Chinchilla paper is a response/extension/correction:
Scaling Laws for Neural Language Models (original scaling law paper, includes experiments with width/depth/etc, includes an experiment with a non-transformer model class)
If you want to understand this post better, I’d recommend reading those papers, or a summary of them.
This post, and the Chinchilla paper itself, are part of the “conversation” started by the Kaplan papers. They implicitly take some of the results from the Kaplan papers for granted, e.g.
“Scaling Laws for Neural Language Models” found that architectural “shape” differences, like width vs. depth, mattered very little compared to N and D. So, later work tends to ignore these differences.
Even if they got some of the details wrong, the Kaplan papers convinced people that LM loss scales in a very regular, predictable manner. It’s empirical work, but it’s the kind of empirical work where your data really does look like it’s closely following some simple curve—not the kind where you fit a simple curve for the sake of interpretation, while understanding that there is a lot of variation it cannot capture.
So, later work tends to be casual about the distinction between “the curve we fit to the data” and “the law governing the real phenomena.” (Theoretical work in this area generally tries to explain why LM loss might follow a simple power law—under the assumption it really does follow such a law—rather than trying to derive some more complicated, real-er functional form.)
I would say that the point of a language model is to capture all statistical irregularities in language. [...]
I can imagine a counter argument to this that says, the text data that humanity has generated is being generated from some Platonic distribution that relates to what humans think and talk about, and we want to capture the regularities in that distribution. The existing corpus of text isn’t the population, it is itself a sampling, and the LLMs are trying to evaluate the regularities from that sample.
Which, sure, that sounds fine, but I think the post sort of just makes it sound like we want to make number go down, and more data make number go down, without really talking about what it means.
Hmm, I think these days the field views “language modeling” as a means to an end—a way to make something useful, or something smart.
We’re not trying to model language for its own sake. It just so happens that, if you (say) want to make a machine that can do all the stuff ChatGPT can do, training a language model is the right first step.
You might find models like DALLE-2 and Stable Diffusion a helpful reference point. These are generative models—what do they for images is (handwaving some nuances) very close to what LMs do for text. But the people creating and using these things aren’t asking, “is this a good/better model of the natural distribution of text-image pairs?” They care about creating pictures on demand, and about how good the pictures are.
Often, it turns out that if you want a model to do cool and impressive things, the best first step is to make a generative model, and make it as good as you can. People want to “make number go down,” not because we care about the number, but because we’ve seen time and time again that when it goes down, all the stuff we do care about gets better.
This doesn’t fully address your question, because it’s not clear that the observed regularity (“number goes down—stuff gets better”) will continue to hold if we change the distribution we use to train the generative model. As an extreme example, if we added more LM training data that consisted of random numbers or letters, I don’t think anyone would expect that to help.
However, if we add data that’s different but still somehow interesting, it does tend to help—on the new data, obviously, but also to some extent on the old data as well. (There’s another Kaplan scaling paper about that, for instance.)
And at this point, I’d feel wary betting against “more data is better (for doing cool and impressive things later),” as long as the data is interestingly structured and has some relationship to things we care about. (See my exchange with gwern here from a few years ago—I think gwern’s perspective more than mine has been borne out over time.)
The answer to each these questions is either “yes” or “tentatively, yes.”
But the evidence doesn’t come from the Chinchilla paper. It comes from the earlier Kaplan et al papers, to which the Chinchilla paper is a response/extension/correction:
Scaling Laws for Neural Language Models (original scaling law paper, includes experiments with width/depth/etc, includes an experiment with a non-transformer model class)
Scaling Laws for Autoregressive Generative Modeling (includes experiments in various non-text and multimodal domains)
If you want to understand this post better, I’d recommend reading those papers, or a summary of them.
This post, and the Chinchilla paper itself, are part of the “conversation” started by the Kaplan papers. They implicitly take some of the results from the Kaplan papers for granted, e.g.
“Scaling Laws for Neural Language Models” found that architectural “shape” differences, like width vs. depth, mattered very little compared to N and D. So, later work tends to ignore these differences.
Even if they got some of the details wrong, the Kaplan papers convinced people that LM loss scales in a very regular, predictable manner. It’s empirical work, but it’s the kind of empirical work where your data really does look like it’s closely following some simple curve—not the kind where you fit a simple curve for the sake of interpretation, while understanding that there is a lot of variation it cannot capture.
So, later work tends to be casual about the distinction between “the curve we fit to the data” and “the law governing the real phenomena.” (Theoretical work in this area generally tries to explain why LM loss might follow a simple power law—under the assumption it really does follow such a law—rather than trying to derive some more complicated, real-er functional form.)
Hmm, I think these days the field views “language modeling” as a means to an end—a way to make something useful, or something smart.
We’re not trying to model language for its own sake. It just so happens that, if you (say) want to make a machine that can do all the stuff ChatGPT can do, training a language model is the right first step.
You might find models like DALLE-2 and Stable Diffusion a helpful reference point. These are generative models—what do they for images is (handwaving some nuances) very close to what LMs do for text. But the people creating and using these things aren’t asking, “is this a good/better model of the natural distribution of text-image pairs?” They care about creating pictures on demand, and about how good the pictures are.
Often, it turns out that if you want a model to do cool and impressive things, the best first step is to make a generative model, and make it as good as you can. People want to “make number go down,” not because we care about the number, but because we’ve seen time and time again that when it goes down, all the stuff we do care about gets better.
This doesn’t fully address your question, because it’s not clear that the observed regularity (“number goes down—stuff gets better”) will continue to hold if we change the distribution we use to train the generative model. As an extreme example, if we added more LM training data that consisted of random numbers or letters, I don’t think anyone would expect that to help.
However, if we add data that’s different but still somehow interesting, it does tend to help—on the new data, obviously, but also to some extent on the old data as well. (There’s another Kaplan scaling paper about that, for instance.)
And at this point, I’d feel wary betting against “more data is better (for doing cool and impressive things later),” as long as the data is interestingly structured and has some relationship to things we care about. (See my exchange with gwern here from a few years ago—I think gwern’s perspective more than mine has been borne out over time.)
Thanks! This whole answer was understandable and clarifying for me.