I have some thoughts that are either confusions, or suggestions for things that should be differently emphasized in this post (which is overall great!).
The first is that, as far as I can tell, these scaling laws are all determined empirically, as in, they literally trained a bunch of models with different parameters and then fit a curve to the points. This is totally fine, that’s how a lot of things are discovered, and the fits look good to me, but a lot of this post reads as thought the law is a Law. For example;
At least in terms of loss, Chinchilla doesn’t just beat Gopher. It beats any modeltrained on Gopher’s data, no matter how big.
This is not literally true, because saying “any model” could include totally different architectures that obey nothing like the empirical curves in this paper.
I’m generally unclear on what the scope of the empirical discovery is. (I’m also not particularly knowledgeable about machine learning.) Do we have reason to think that it applies in domains outside text completion? Does it apply to models that don’t use transformers? (Is that even a thing now?) Does it apply across all the other bazillion parameters that go into a particular model, like, I dunno, the learning rate, or network width vs depth?
It also feels like the discussion over “have we used all the data” is skimming over what the purpose of a language model is, or what loss even means. To make an analogy for comparison, consider someone saying “the US census has gathered all possible data on the heights of US citizens. To get a more accurate model, we need to create more US citizens.”
I would say that the point of a language model is to capture all statistical irregularities in language. If we’ve used all the data, then that’s it, we did it. Creating more data will be changing the actual population that we are trying to run stats on, it will be adding more patterns that weren’t there before.
I can imagine a counter argument to this that says, the text data that humanity has generated is being generated from some Platonic distribution that relates to what humans think and talk about, and we want to capture the regularities in that distribution. The existing corpus of text isn’t the population, it is itself a sampling, and the LLMs are trying to evaluate the regularities from that sample.
Which, sure, that sounds fine, but I think the post sort of just makes it sound like we want to make number go down, and more data make number go down, without really talking about what it means.
I’m generally unclear on what the scope of the empirical discovery is. (I’m also not particularly knowledgeable about machine learning.) Do we have reason to think that it applies in domains outside text completion? Does it apply to models that don’t use transformers? (Is that even a thing now?) Does it apply across all the other bazillion parameters that go into a particular model, like, I dunno, the learning rate, or network width vs depth?
The answer to each these questions is either “yes” or “tentatively, yes.”
But the evidence doesn’t come from the Chinchilla paper. It comes from the earlier Kaplan et al papers, to which the Chinchilla paper is a response/extension/correction:
Scaling Laws for Neural Language Models (original scaling law paper, includes experiments with width/depth/etc, includes an experiment with a non-transformer model class)
If you want to understand this post better, I’d recommend reading those papers, or a summary of them.
This post, and the Chinchilla paper itself, are part of the “conversation” started by the Kaplan papers. They implicitly take some of the results from the Kaplan papers for granted, e.g.
“Scaling Laws for Neural Language Models” found that architectural “shape” differences, like width vs. depth, mattered very little compared to N and D. So, later work tends to ignore these differences.
Even if they got some of the details wrong, the Kaplan papers convinced people that LM loss scales in a very regular, predictable manner. It’s empirical work, but it’s the kind of empirical work where your data really does look like it’s closely following some simple curve—not the kind where you fit a simple curve for the sake of interpretation, while understanding that there is a lot of variation it cannot capture.
So, later work tends to be casual about the distinction between “the curve we fit to the data” and “the law governing the real phenomena.” (Theoretical work in this area generally tries to explain why LM loss might follow a simple power law—under the assumption it really does follow such a law—rather than trying to derive some more complicated, real-er functional form.)
I would say that the point of a language model is to capture all statistical irregularities in language. [...]
I can imagine a counter argument to this that says, the text data that humanity has generated is being generated from some Platonic distribution that relates to what humans think and talk about, and we want to capture the regularities in that distribution. The existing corpus of text isn’t the population, it is itself a sampling, and the LLMs are trying to evaluate the regularities from that sample.
Which, sure, that sounds fine, but I think the post sort of just makes it sound like we want to make number go down, and more data make number go down, without really talking about what it means.
Hmm, I think these days the field views “language modeling” as a means to an end—a way to make something useful, or something smart.
We’re not trying to model language for its own sake. It just so happens that, if you (say) want to make a machine that can do all the stuff ChatGPT can do, training a language model is the right first step.
You might find models like DALLE-2 and Stable Diffusion a helpful reference point. These are generative models—what do they for images is (handwaving some nuances) very close to what LMs do for text. But the people creating and using these things aren’t asking, “is this a good/better model of the natural distribution of text-image pairs?” They care about creating pictures on demand, and about how good the pictures are.
Often, it turns out that if you want a model to do cool and impressive things, the best first step is to make a generative model, and make it as good as you can. People want to “make number go down,” not because we care about the number, but because we’ve seen time and time again that when it goes down, all the stuff we do care about gets better.
This doesn’t fully address your question, because it’s not clear that the observed regularity (“number goes down—stuff gets better”) will continue to hold if we change the distribution we use to train the generative model. As an extreme example, if we added more LM training data that consisted of random numbers or letters, I don’t think anyone would expect that to help.
However, if we add data that’s different but still somehow interesting, it does tend to help—on the new data, obviously, but also to some extent on the old data as well. (There’s another Kaplan scaling paper about that, for instance.)
And at this point, I’d feel wary betting against “more data is better (for doing cool and impressive things later),” as long as the data is interestingly structured and has some relationship to things we care about. (See my exchange with gwern here from a few years ago—I think gwern’s perspective more than mine has been borne out over time.)
I have some thoughts that are either confusions, or suggestions for things that should be differently emphasized in this post (which is overall great!).
The first is that, as far as I can tell, these scaling laws are all determined empirically, as in, they literally trained a bunch of models with different parameters and then fit a curve to the points. This is totally fine, that’s how a lot of things are discovered, and the fits look good to me, but a lot of this post reads as thought the law is a Law. For example;
This is not literally true, because saying “any model” could include totally different architectures that obey nothing like the empirical curves in this paper.
I’m generally unclear on what the scope of the empirical discovery is. (I’m also not particularly knowledgeable about machine learning.) Do we have reason to think that it applies in domains outside text completion? Does it apply to models that don’t use transformers? (Is that even a thing now?) Does it apply across all the other bazillion parameters that go into a particular model, like, I dunno, the learning rate, or network width vs depth?
It also feels like the discussion over “have we used all the data” is skimming over what the purpose of a language model is, or what loss even means. To make an analogy for comparison, consider someone saying “the US census has gathered all possible data on the heights of US citizens. To get a more accurate model, we need to create more US citizens.”
I would say that the point of a language model is to capture all statistical irregularities in language. If we’ve used all the data, then that’s it, we did it. Creating more data will be changing the actual population that we are trying to run stats on, it will be adding more patterns that weren’t there before.
I can imagine a counter argument to this that says, the text data that humanity has generated is being generated from some Platonic distribution that relates to what humans think and talk about, and we want to capture the regularities in that distribution. The existing corpus of text isn’t the population, it is itself a sampling, and the LLMs are trying to evaluate the regularities from that sample.
Which, sure, that sounds fine, but I think the post sort of just makes it sound like we want to make number go down, and more data make number go down, without really talking about what it means.
The answer to each these questions is either “yes” or “tentatively, yes.”
But the evidence doesn’t come from the Chinchilla paper. It comes from the earlier Kaplan et al papers, to which the Chinchilla paper is a response/extension/correction:
Scaling Laws for Neural Language Models (original scaling law paper, includes experiments with width/depth/etc, includes an experiment with a non-transformer model class)
Scaling Laws for Autoregressive Generative Modeling (includes experiments in various non-text and multimodal domains)
If you want to understand this post better, I’d recommend reading those papers, or a summary of them.
This post, and the Chinchilla paper itself, are part of the “conversation” started by the Kaplan papers. They implicitly take some of the results from the Kaplan papers for granted, e.g.
“Scaling Laws for Neural Language Models” found that architectural “shape” differences, like width vs. depth, mattered very little compared to N and D. So, later work tends to ignore these differences.
Even if they got some of the details wrong, the Kaplan papers convinced people that LM loss scales in a very regular, predictable manner. It’s empirical work, but it’s the kind of empirical work where your data really does look like it’s closely following some simple curve—not the kind where you fit a simple curve for the sake of interpretation, while understanding that there is a lot of variation it cannot capture.
So, later work tends to be casual about the distinction between “the curve we fit to the data” and “the law governing the real phenomena.” (Theoretical work in this area generally tries to explain why LM loss might follow a simple power law—under the assumption it really does follow such a law—rather than trying to derive some more complicated, real-er functional form.)
Hmm, I think these days the field views “language modeling” as a means to an end—a way to make something useful, or something smart.
We’re not trying to model language for its own sake. It just so happens that, if you (say) want to make a machine that can do all the stuff ChatGPT can do, training a language model is the right first step.
You might find models like DALLE-2 and Stable Diffusion a helpful reference point. These are generative models—what do they for images is (handwaving some nuances) very close to what LMs do for text. But the people creating and using these things aren’t asking, “is this a good/better model of the natural distribution of text-image pairs?” They care about creating pictures on demand, and about how good the pictures are.
Often, it turns out that if you want a model to do cool and impressive things, the best first step is to make a generative model, and make it as good as you can. People want to “make number go down,” not because we care about the number, but because we’ve seen time and time again that when it goes down, all the stuff we do care about gets better.
This doesn’t fully address your question, because it’s not clear that the observed regularity (“number goes down—stuff gets better”) will continue to hold if we change the distribution we use to train the generative model. As an extreme example, if we added more LM training data that consisted of random numbers or letters, I don’t think anyone would expect that to help.
However, if we add data that’s different but still somehow interesting, it does tend to help—on the new data, obviously, but also to some extent on the old data as well. (There’s another Kaplan scaling paper about that, for instance.)
And at this point, I’d feel wary betting against “more data is better (for doing cool and impressive things later),” as long as the data is interestingly structured and has some relationship to things we care about. (See my exchange with gwern here from a few years ago—I think gwern’s perspective more than mine has been borne out over time.)
Thanks! This whole answer was understandable and clarifying for me.