I’ve read your linked post thrice now, it’s excellent, any remaining confusions are my fault.
I didn’t confidently expect you to disagree, I just guessed you did. The reason is that the statement you DID disagree with: ” The scaling laws, IIRC, don’t tell us how much data is needed to reach a useful level of performance. ” was, in my mind, closely related to the paragraph about the human brain which you agree with. Since they were closely related in my mind, I thought if you disagreed with one you’d disagree with the other. The statement about brains is the one I care more about, since it relates to my disagreement with Rohin.
I’m glad my 2D visualization is qualitatively correct! Quantitatively, roughly how many degrees do you think there would be between the L(D) and L(N) laws? In my example it was 30, but of course I just made that up.
Actually, I think I spoke too soon about the visualization… I don’t think your image of L(D) and L(N) is quite right.
Here is what the actual visualization looks like. More blue = lower loss, and I made it a contour plot so it’s easy to see indifference curves of the loss.
In these coordinates, L(D) and L(N) are not really straight lines, but they are close to straight lines when we are far from the diagonal line:
If you look at the upper left region, the indifference curves are parallel to the vertical (N) axis. That is, in this regime, N doesn’t matter and loss is effectively a function of D alone.
This is L(D).
It looks like the color changes you see if you move horizontally through the upper left region.
Likewise, in the lower right region, D doesn’t matter and loss depends on N alone.
This is L(N).
It looks like the color changes you see if you move vertically through the lower right region.
To restate my earlier claims…
If either N or D is orders of magnitude larger than the other, then you get close to the same loss you would get from N ~ D ~ (whichever OOM is lower). So, setting eg (N, D) = (1e15, 1e12) would be sort of a waste of N, achieving only slightly lower loss than (N, D) = (1e12, 1e12).
This is what motives the heuristic that you scale D with N, to stay on the diagonal line.
On the other hand, if your goal is to reach some target loss and you have resource constraints, what matters is whichever resource constraint is more restrictive. For example, if we were never able to scale D above 1e12, then we would be stuck achieving a loss similar to GPT-3, never reaching the darkest colors on the graph.
When I said that it’s intuitive to think about L(D) and L(N), I mean that I care about which target losses we can reach. And that’s going to be set, more or less, by the highest N or the highest D we can reach, whichever is more restrictive.
Asking “what could we do with a N=1e15 model?” (or any other number) is kind of a weird question from the perspective of this plot. It could mean either of two very different situations: either we are in the top right corner with N and D scaled together, hitting the bluest region … or we are just near the top somewhere, in which case our loss is entirely determined by D and can be arbitrarily low.
In Ajeya’s work, this question means “let’s assume we’re using an N=1e15 model, and then let’s assume we actually need that many parameters, which must mean we want to reach the target losses in the upper right corner, and then let’s figure out how big D has to be to get there.”
So, the a priori choice of N=1e15 is driving the definition of sufficient performance, defined here as “the performance which you could only reach with N=1e15 params”.
What feels weird to me—which you touched on above—is the way this lets the scaling relations “backset drive” the definition of sufficient quality for AGI. Instead of saying we want to achieve some specific thing, then deducing we would need N=1e15 params to do it… we start with an unspecified goal and the postulate that we need N=1e15 params to reach it, and then derive the goal from there.
OK, wow, I didn’t realize the indifference curves were so close to being indifference L-shapes! Now I think Ajeya’s methodology was great after all—my worries have been largely dispelled!
Given that the indifference curves are so close to being L-shaped, it seems there’a a pretty strong argument to be made that since the human brain has 10e15 params or so, it must be doing some fairly important tasks which can’t be done (at least not as well) for much less than 10e15 params. Like, maybe a 10e13 param brain could do the task if it didn’t have to worry about other biological constraints like noisy neurons that occasionally die randomly, or being energy-efficient, etc. But probably these constraints and others like them aren’t that big a deal, such that we can be fairly confident that these tasks require a NN of 10e13 or more params.
The next step in the argument is to say that TAI requires one of these tasks. Then we point out that an AI which is bigger than the human brain should be able to do all the things it can do, in principle. Thus we feel justified in setting the parameter count of our hypothetical TAI to “within a few OOMs of 10e15.”
Then we look at the scaling law chart you just provided us, and we look at those L-shaped indifference curves, and we think: OK, so a task which can’t be done for less than 10e15 params is a task which requires 10e15 data points also. Because otherwise we could reduce parameter count below 10e15 and keep the same performance.
So I no longer feel weird about this; I feel like this part of Ajeya’s analysis makes sense.
But I am now intensely curious as to how many “data points” the human brain has. Either the argument I just gave above is totally wrong, or the human brain must be trained on 10e15 data points in the course of a human lifetime, or the genome must be substituting for the data points via priors, architectures, etc.
Is the second possibility plausible? I guess so. there are 10^9 seconds in a human lifetime, so if you are processing a million data points a second… Huh, that seems a bit much.
What about active learning and the like? You talked about how sufficiently big models are extracting all the info out of the data, and so that’s why you need more data to do better—but that suggests that curating the data to make it more info-dense should reduce compute requirements, right? Maybe that’s what humans are doing—“only” a billion data points in a lifetime, but really high-quality ones and good mechanisms for focusing on the right stuff to update on of all your sensory data coming in?
And then there’s the third possibility of course. The third possibility says: These scaling laws only apply to blank-slate, simple neural nets. The brain is not a blank slate, nor is it simple; it has lots of instincts and modules and priors etc. given to it by evolution. So that’s how humans can get away with only 10^9 data points or so. (well, I guess it should be more like 10^11, right? Each second of experience is more than just one data point, probably more like a hundred, right? What would you say?)
Then we look at the scaling law chart you just provided us, and we look at those L-shaped indifference curves, and we think: OK, so a task which can’t be done for less than 10e15 params is a task which requires 10e15 data points also.
In the picture, it looks like there’s something special about having a 1:1 ratio of data to params. But this is a coincidence due to the authors’ choice of units.
They define “one data point” as “one token,” which is fine. But it seems equally defensible to define “one data point” as “what the model can process in one forward pass,” which is ~1e3 tokens. If the authors had chosen that definition in their paper, I would be showing you a picture that looked identical except with different numbers on the data axis, and you would conclude from the picture that the brain should have around 1e12 data points to match its 1e15 params!
To state the point generally, the functional form of the scaling law says nothing about the actual ratio D/N where the indifference curves have their cusps. This depends on your choice of units. And, even if we were careful to use the same units, this ratio could be vastly different for different systems, and people would still say the systems “have the same scaling law.” Scaling is about relationships between differences, not relationships between absolute magnitudes.
On the larger topic, I’m pessimistic about our ability to figure out how many parameters the brain has, and even more pessimistic about our ability to understand what a reasonable scale for “a data point” is. This is mostly for “Could a Neuroscientist Understand a Microprocessor?”-type reasons. I would be more interested in an argument that starts with upper/lower bounds that feel absurdly extreme but relatively certain, and then tries to understand if (even) these weak bounds imply anything interesting, rather than an argument that aims for an point estimate or a subjective distribution.
They define “one data point” as “one token,” which is fine. But it seems equally defensible to define “one data point” as “what the model can process in one forward pass,” which is ~1e3 tokens. If the authors had chosen that definition in their paper, I would be showing you a picture that looked identical except with different numbers on the data axis, and you would conclude from the picture that the brain should have around 1e12 data points to match its 1e15 params!
Holy shit, mind blown! Then… how are the scaling laws useful at all then? I thought the whole point was to tell you how to divide your compute between… Oh, I see. The recommendations for how to divide up your compute would be the same regardless of which definition of data we used. I guess this suggests that it would be most convenient to define data as “how long you run the model during training” (which in turn is maybe “how many times the average parameter of the model is activated during training?”) Because that way we can just multiply parameter count by data to get our total compute cost. Or maybe instead we should do what Ajeya does, and define data as the number of updates to the model * the batch size, and then calculate compute by multiplying data * “horizon length.”
I’m very interested to hear your thoughts on Ajeya’s methodology. Is my sketch of it above accurate? Do you agree it’s a good methodology? Does it indeed imply (in conjunction with the scaling laws) that a model with 10^15 params should need 10^15 data points to train to a performance level that you couldn’t have got more easily with a smaller model—regardless of what the horizon length is, or what your training environment is, or what the task is?
...
As for the broader point, what do you think of the Carlsmith report? The figure given in the conclusion seems to give some absurdly extreme but reasonably certain upper and lower bounds. And I think the conclusions we draw from them are already drawn in Ajeya’s report, because she includes uncertainty about this in her model. I suppose you could just redo her model but with even more variance… that would probably make her timelines shorter, funnily enough!
Update: According to this the human brain actually is getting ~10^7 bits of data every second, although the highest level conscious awareness is only processing ~50. So insofar as we go with the “tokens” definition, it does seem that the human brain is processing plenty of tokens for its parameter count -- 10^16, in fact, over the course of its lifetime. More than enough! And insofar as we go with the “single pass through the network” definition, which would mean we are looking for about 10^12… then we get a small discrepancy; the maximum firing rate of neurons is 250 − 1000 times per second, which means 10^11.5 or so… actually this more or less checks out I’d say. Assuming it’s the max rate that matters and not the average rate (the average rate is about once per second).
Does this mean that it may not actually be true that humans are several OOMs more data-efficient than ANNs? Maybe the apparent data-efficiency advantage is really mostly just the result of transfer learning from vast previous life experience, just as GPT-3 can “few-shot learn” totally new tasks, and also “fine-tune” on relatively small amounts of data (3+ OOMs less, according to the transfer laws paper!) but really what’s going on is just transfer learning from its vast pre-training experience.
I’ve read your linked post thrice now, it’s excellent, any remaining confusions are my fault.
I didn’t confidently expect you to disagree, I just guessed you did. The reason is that the statement you DID disagree with: ” The scaling laws, IIRC, don’t tell us how much data is needed to reach a useful level of performance. ” was, in my mind, closely related to the paragraph about the human brain which you agree with. Since they were closely related in my mind, I thought if you disagreed with one you’d disagree with the other. The statement about brains is the one I care more about, since it relates to my disagreement with Rohin.
I’m glad my 2D visualization is qualitatively correct! Quantitatively, roughly how many degrees do you think there would be between the L(D) and L(N) laws? In my example it was 30, but of course I just made that up.
Actually, I think I spoke too soon about the visualization… I don’t think your image of L(D) and L(N) is quite right.
Here is what the actual visualization looks like. More blue = lower loss, and I made it a contour plot so it’s easy to see indifference curves of the loss.
https://64.media.tumblr.com/8b1897853a66bccafa72043b2717a198/de8ee87db2e582fd-63/s540x810/8b960b152359e9379916ff878c80f130034d1cbb.png
In these coordinates, L(D) and L(N) are not really straight lines, but they are close to straight lines when we are far from the diagonal line:
If you look at the upper left region, the indifference curves are parallel to the vertical (N) axis. That is, in this regime, N doesn’t matter and loss is effectively a function of D alone.
This is L(D).
It looks like the color changes you see if you move horizontally through the upper left region.
Likewise, in the lower right region, D doesn’t matter and loss depends on N alone.
This is L(N).
It looks like the color changes you see if you move vertically through the lower right region.
To restate my earlier claims…
If either N or D is orders of magnitude larger than the other, then you get close to the same loss you would get from N ~ D ~ (whichever OOM is lower). So, setting eg (N, D) = (1e15, 1e12) would be sort of a waste of N, achieving only slightly lower loss than (N, D) = (1e12, 1e12).
This is what motives the heuristic that you scale D with N, to stay on the diagonal line.
On the other hand, if your goal is to reach some target loss and you have resource constraints, what matters is whichever resource constraint is more restrictive. For example, if we were never able to scale D above 1e12, then we would be stuck achieving a loss similar to GPT-3, never reaching the darkest colors on the graph.
When I said that it’s intuitive to think about L(D) and L(N), I mean that I care about which target losses we can reach. And that’s going to be set, more or less, by the highest N or the highest D we can reach, whichever is more restrictive.
Asking “what could we do with a N=1e15 model?” (or any other number) is kind of a weird question from the perspective of this plot. It could mean either of two very different situations: either we are in the top right corner with N and D scaled together, hitting the bluest region … or we are just near the top somewhere, in which case our loss is entirely determined by D and can be arbitrarily low.
In Ajeya’s work, this question means “let’s assume we’re using an N=1e15 model, and then let’s assume we actually need that many parameters, which must mean we want to reach the target losses in the upper right corner, and then let’s figure out how big D has to be to get there.”
So, the a priori choice of N=1e15 is driving the definition of sufficient performance, defined here as “the performance which you could only reach with N=1e15 params”.
What feels weird to me—which you touched on above—is the way this lets the scaling relations “backset drive” the definition of sufficient quality for AGI. Instead of saying we want to achieve some specific thing, then deducing we would need N=1e15 params to do it… we start with an unspecified goal and the postulate that we need N=1e15 params to reach it, and then derive the goal from there.
OK, wow, I didn’t realize the indifference curves were so close to being indifference L-shapes! Now I think Ajeya’s methodology was great after all—my worries have been largely dispelled!
Given that the indifference curves are so close to being L-shaped, it seems there’a a pretty strong argument to be made that since the human brain has 10e15 params or so, it must be doing some fairly important tasks which can’t be done (at least not as well) for much less than 10e15 params. Like, maybe a 10e13 param brain could do the task if it didn’t have to worry about other biological constraints like noisy neurons that occasionally die randomly, or being energy-efficient, etc. But probably these constraints and others like them aren’t that big a deal, such that we can be fairly confident that these tasks require a NN of 10e13 or more params.
The next step in the argument is to say that TAI requires one of these tasks. Then we point out that an AI which is bigger than the human brain should be able to do all the things it can do, in principle. Thus we feel justified in setting the parameter count of our hypothetical TAI to “within a few OOMs of 10e15.”
Then we look at the scaling law chart you just provided us, and we look at those L-shaped indifference curves, and we think: OK, so a task which can’t be done for less than 10e15 params is a task which requires 10e15 data points also. Because otherwise we could reduce parameter count below 10e15 and keep the same performance.
So I no longer feel weird about this; I feel like this part of Ajeya’s analysis makes sense.
But I am now intensely curious as to how many “data points” the human brain has. Either the argument I just gave above is totally wrong, or the human brain must be trained on 10e15 data points in the course of a human lifetime, or the genome must be substituting for the data points via priors, architectures, etc.
Is the second possibility plausible? I guess so. there are 10^9 seconds in a human lifetime, so if you are processing a million data points a second… Huh, that seems a bit much.
What about active learning and the like? You talked about how sufficiently big models are extracting all the info out of the data, and so that’s why you need more data to do better—but that suggests that curating the data to make it more info-dense should reduce compute requirements, right? Maybe that’s what humans are doing—“only” a billion data points in a lifetime, but really high-quality ones and good mechanisms for focusing on the right stuff to update on of all your sensory data coming in?
And then there’s the third possibility of course. The third possibility says: These scaling laws only apply to blank-slate, simple neural nets. The brain is not a blank slate, nor is it simple; it has lots of instincts and modules and priors etc. given to it by evolution. So that’s how humans can get away with only 10^9 data points or so. (well, I guess it should be more like 10^11, right? Each second of experience is more than just one data point, probably more like a hundred, right? What would you say?)
What do you think of these three possibilities?
I’m don’t think this step makes sense:
In the picture, it looks like there’s something special about having a 1:1 ratio of data to params. But this is a coincidence due to the authors’ choice of units.
They define “one data point” as “one token,” which is fine. But it seems equally defensible to define “one data point” as “what the model can process in one forward pass,” which is ~1e3 tokens. If the authors had chosen that definition in their paper, I would be showing you a picture that looked identical except with different numbers on the data axis, and you would conclude from the picture that the brain should have around 1e12 data points to match its 1e15 params!
To state the point generally, the functional form of the scaling law says nothing about the actual ratio D/N where the indifference curves have their cusps. This depends on your choice of units. And, even if we were careful to use the same units, this ratio could be vastly different for different systems, and people would still say the systems “have the same scaling law.” Scaling is about relationships between differences, not relationships between absolute magnitudes.
On the larger topic, I’m pessimistic about our ability to figure out how many parameters the brain has, and even more pessimistic about our ability to understand what a reasonable scale for “a data point” is. This is mostly for “Could a Neuroscientist Understand a Microprocessor?”-type reasons. I would be more interested in an argument that starts with upper/lower bounds that feel absurdly extreme but relatively certain, and then tries to understand if (even) these weak bounds imply anything interesting, rather than an argument that aims for an point estimate or a subjective distribution.
Holy shit, mind blown! Then… how are the scaling laws useful at all then? I thought the whole point was to tell you how to divide your compute between… Oh, I see. The recommendations for how to divide up your compute would be the same regardless of which definition of data we used. I guess this suggests that it would be most convenient to define data as “how long you run the model during training” (which in turn is maybe “how many times the average parameter of the model is activated during training?”) Because that way we can just multiply parameter count by data to get our total compute cost. Or maybe instead we should do what Ajeya does, and define data as the number of updates to the model * the batch size, and then calculate compute by multiplying data * “horizon length.”
I’m very interested to hear your thoughts on Ajeya’s methodology. Is my sketch of it above accurate? Do you agree it’s a good methodology? Does it indeed imply (in conjunction with the scaling laws) that a model with 10^15 params should need 10^15 data points to train to a performance level that you couldn’t have got more easily with a smaller model—regardless of what the horizon length is, or what your training environment is, or what the task is?
...
As for the broader point, what do you think of the Carlsmith report? The figure given in the conclusion seems to give some absurdly extreme but reasonably certain upper and lower bounds. And I think the conclusions we draw from them are already drawn in Ajeya’s report, because she includes uncertainty about this in her model. I suppose you could just redo her model but with even more variance… that would probably make her timelines shorter, funnily enough!
Update: According to this the human brain actually is getting ~10^7 bits of data every second, although the highest level conscious awareness is only processing ~50. So insofar as we go with the “tokens” definition, it does seem that the human brain is processing plenty of tokens for its parameter count -- 10^16, in fact, over the course of its lifetime. More than enough! And insofar as we go with the “single pass through the network” definition, which would mean we are looking for about 10^12… then we get a small discrepancy; the maximum firing rate of neurons is 250 − 1000 times per second, which means 10^11.5 or so… actually this more or less checks out I’d say. Assuming it’s the max rate that matters and not the average rate (the average rate is about once per second).
Does this mean that it may not actually be true that humans are several OOMs more data-efficient than ANNs? Maybe the apparent data-efficiency advantage is really mostly just the result of transfer learning from vast previous life experience, just as GPT-3 can “few-shot learn” totally new tasks, and also “fine-tune” on relatively small amounts of data (3+ OOMs less, according to the transfer laws paper!) but really what’s going on is just transfer learning from its vast pre-training experience.