The scaling laws, IIRC, don’t tell us how much data is needed to reach a useful level of performance.
The scaling laws from the Kaplan et al papers do tell you this.
The relevant law is L(N,D), for the early-stopped test loss given parameter count N and data size D. It has the functional form
L(N,D)=[(Nc/N)αN/αD+(Dc/D)]αD
with αN∼0.076,αD∼0.095.
The result that you should scale D∝N0.74 comes from trying to keep the two terms in this formula about the same size.
This is not exactly a heuristic for managing compute (since D is not dependent on compute, it’s dependent on how much data you can source). It’s more like a heuristic for ensuring that your problem is the right level of difficulty to show off the power of this model size, as compared to smaller models.
You always can train models that are “too large” on datasets that are “too small” according to the heuristic, and they won’t diverge or do poorly or anything. They just won’t improve much upon the results of smaller models.
In terms of the above, you are setting N∼1015 and then asking what D ought to be. If the heuristic gives you an answer that seems very high, that doesn’t mean the model is “not as data efficient as you expected.” Rather, it means that you need a very large dataset if you want a good reason to push the parameter count up to N∼1015 rather than using a smaller model to get almost identical performance.
I find it more intuitive to think about the following, both discussed in the papers:
L(D), the N→∞ limit of L(N,D)
meaning: the peak data efficiency possible with this model class
L(N), the D→∞ limit of L(N,D)
meaning: the scaling of loss with parameters when not data-constrained but still using early stopping
If the Kaplan et al scaling results are relevant for AGI, I expect one of these two limits to provide the relevant constraint, rather than a careful balance between N and D to ensure we are not in either limit.
Ultimately, we expect AGI to require some specific-if-unknown level of performance (ie crossing some loss threshold LAGI). Ajeya’s approach essentially assumes that we’ll cross this threshold at a particular value of N, and then further assumes that this will happen in a regime where data and compute limitations are around the same order of magnitude.
I’m not sure why that ought to be true: it seems more likely that one side of the problem will become practically difficult to scale in proportion to the other, after a certain point, and we will essentially hug tight to either the L(N) or the L(D) curve until it hits LAGI.
Huh, thanks, now I’m more confused about the scaling laws than I was before, in a good way! I appreciate the explanation you gave but am still confused. Some questions:
--In my discussion with Rohin I said:
Since evolution obviously cares about a lot more than that (and indeed doesn’t care about minimizing compute at all, it just cares about minimizing size and training time separately, with no particular ratio between them except that which is set by the fitness landscape) the laws aren’t directly relevant. In other words, for all we know, if the human brain was 3 OOMs smaller and had one OOM more training time it would be qualitatively superior! Or for all we know, if it had 1 OOM more synapses it would need 2 OOMs less training time to be just as capable. Or… etc. Judging by the scaling laws, it seems like the human brain has a lot more synapses than its childhood length would suggest for optimal performance, or else a lot less if you buy the idea that evolutionary history is part of its training data.
Do you agree or disagree? My guess is that you’d disagree, since you say:
If the heuristic gives you an answer that seems very high, that doesn’t mean the model is “not as data efficient as you expected.” Rather, it means that you need a very large dataset if you want a good reason to push the parameter count up to N∼10^15 rather than using a smaller model to get almost identical performance.
which I take to mean that you think the human brain could have had almost identical performance with much fewer synapses, since it has much more N than is appropriate given its D? (But wait, surely you don’t think that… OK, yeah, I’m just very confused here, please help!)
2. You say “This is not exactly a heuristic for managing compute (since D is not dependent on compute, it’s dependent on how much data you can source).” Well, isn’t it both? You can’t have more D than you have compute, in some sense, because D isn’t the amount of training examples you’ve collected, it’s the amount you actually use to train… right? So… isn’t this a heuristic for managing compute? It sure seemed like it was presented that way.
3. Perhaps it would help me if I could visualize it in two dimensions. Let the y-axis be parameter count, N, and the x-axis be data trained on, D. Make it a heat map with color = loss. Bluer = lower loss. It sounds to me like the compute-optimal scaling law Kaplan et al tout is something like a 45 degree line from the origin such that every point on the line has the lowest loss of all the points on an equivalent-compute indifference curve that contains that point. Whereas you are saying there are two other interesting lines, the L(D) line and the L(N) line, and the L(D) line is (say) a 60-degree line from the origin such that for any point on that line, all points straight above it are exactly as blue. And the L(N) line is (say) a 30-degree line from the origin such that for any point on that line, all points straight to the right of it are exactly as blue. This is the picture I currently have in my head, is it correct in your opinion? (And you are saying that probably when we hit AGI we won’t be on the 45-degree line but rather will be constrained by model size or by data and so will be hugging one of the other two lines)
You can’t have more D than you have compute, in some sense, because D isn’t the amount of training examples you’ve collected, it’s the amount you actually use to train… right? So… isn’t this a heuristic for managing compute? It sure seemed like it was presented that way.
This is a subtle and confusing thing about the Kaplan et al papers. (It’s also the subject of my post that I linked earlier, so I recommend you check that out.)
There are two things in the papers that could be called “optimal compute budgeting” laws:
A law that assumes a sufficiently large dataset (ie effectively infinite dataset), and tell you how to manage the tradeoff between steps S and params N.
The law we discussed above, that assumes a finite dataset, and then tells you how to manage its size D vs params N.
I said the D vs N law was “not a heuristic for managing compute” because the S vs N law is more directly about compute, and is what the authors mean when they talk about compute optimal budgeting.
However, the D vs N law does tell you about how to spend compute in an indirect way, for the exact reason you say, that D is related to how long you train. Comparing the two laws yields the “breakdown” or “kink point.”
Do you agree or disagree? … I take [you] to mean that you think the human brain could have had almost identical performance with much fewer synapses, since it has much more N than is appropriate given its D?
Sorry, why do you expect I disagree? I think I agree. But also, I’m not really claiming the scaling laws say or don’t say anything about the brain, I’m just trying to clarify what they say about (specific kinds of) neural nets (on specific kinds of problems). We have to first understand what they predict about neural nets before we can go on to ask whether those predictions generalize to explain some other area.
Perhaps it would help me if I could visualize it in two dimensions
This part is 100% qualitatively accurate, I think. The one exception is that there are two “optimal compute” lines on the plot with different slopes, for the two laws referred to above. But yeah, I’m saying we won’t be on either of those lines, but on the L(N) or the L(D) line.
I’ve read your linked post thrice now, it’s excellent, any remaining confusions are my fault.
I didn’t confidently expect you to disagree, I just guessed you did. The reason is that the statement you DID disagree with: ” The scaling laws, IIRC, don’t tell us how much data is needed to reach a useful level of performance. ” was, in my mind, closely related to the paragraph about the human brain which you agree with. Since they were closely related in my mind, I thought if you disagreed with one you’d disagree with the other. The statement about brains is the one I care more about, since it relates to my disagreement with Rohin.
I’m glad my 2D visualization is qualitatively correct! Quantitatively, roughly how many degrees do you think there would be between the L(D) and L(N) laws? In my example it was 30, but of course I just made that up.
Actually, I think I spoke too soon about the visualization… I don’t think your image of L(D) and L(N) is quite right.
Here is what the actual visualization looks like. More blue = lower loss, and I made it a contour plot so it’s easy to see indifference curves of the loss.
In these coordinates, L(D) and L(N) are not really straight lines, but they are close to straight lines when we are far from the diagonal line:
If you look at the upper left region, the indifference curves are parallel to the vertical (N) axis. That is, in this regime, N doesn’t matter and loss is effectively a function of D alone.
This is L(D).
It looks like the color changes you see if you move horizontally through the upper left region.
Likewise, in the lower right region, D doesn’t matter and loss depends on N alone.
This is L(N).
It looks like the color changes you see if you move vertically through the lower right region.
To restate my earlier claims…
If either N or D is orders of magnitude larger than the other, then you get close to the same loss you would get from N ~ D ~ (whichever OOM is lower). So, setting eg (N, D) = (1e15, 1e12) would be sort of a waste of N, achieving only slightly lower loss than (N, D) = (1e12, 1e12).
This is what motives the heuristic that you scale D with N, to stay on the diagonal line.
On the other hand, if your goal is to reach some target loss and you have resource constraints, what matters is whichever resource constraint is more restrictive. For example, if we were never able to scale D above 1e12, then we would be stuck achieving a loss similar to GPT-3, never reaching the darkest colors on the graph.
When I said that it’s intuitive to think about L(D) and L(N), I mean that I care about which target losses we can reach. And that’s going to be set, more or less, by the highest N or the highest D we can reach, whichever is more restrictive.
Asking “what could we do with a N=1e15 model?” (or any other number) is kind of a weird question from the perspective of this plot. It could mean either of two very different situations: either we are in the top right corner with N and D scaled together, hitting the bluest region … or we are just near the top somewhere, in which case our loss is entirely determined by D and can be arbitrarily low.
In Ajeya’s work, this question means “let’s assume we’re using an N=1e15 model, and then let’s assume we actually need that many parameters, which must mean we want to reach the target losses in the upper right corner, and then let’s figure out how big D has to be to get there.”
So, the a priori choice of N=1e15 is driving the definition of sufficient performance, defined here as “the performance which you could only reach with N=1e15 params”.
What feels weird to me—which you touched on above—is the way this lets the scaling relations “backset drive” the definition of sufficient quality for AGI. Instead of saying we want to achieve some specific thing, then deducing we would need N=1e15 params to do it… we start with an unspecified goal and the postulate that we need N=1e15 params to reach it, and then derive the goal from there.
OK, wow, I didn’t realize the indifference curves were so close to being indifference L-shapes! Now I think Ajeya’s methodology was great after all—my worries have been largely dispelled!
Given that the indifference curves are so close to being L-shaped, it seems there’a a pretty strong argument to be made that since the human brain has 10e15 params or so, it must be doing some fairly important tasks which can’t be done (at least not as well) for much less than 10e15 params. Like, maybe a 10e13 param brain could do the task if it didn’t have to worry about other biological constraints like noisy neurons that occasionally die randomly, or being energy-efficient, etc. But probably these constraints and others like them aren’t that big a deal, such that we can be fairly confident that these tasks require a NN of 10e13 or more params.
The next step in the argument is to say that TAI requires one of these tasks. Then we point out that an AI which is bigger than the human brain should be able to do all the things it can do, in principle. Thus we feel justified in setting the parameter count of our hypothetical TAI to “within a few OOMs of 10e15.”
Then we look at the scaling law chart you just provided us, and we look at those L-shaped indifference curves, and we think: OK, so a task which can’t be done for less than 10e15 params is a task which requires 10e15 data points also. Because otherwise we could reduce parameter count below 10e15 and keep the same performance.
So I no longer feel weird about this; I feel like this part of Ajeya’s analysis makes sense.
But I am now intensely curious as to how many “data points” the human brain has. Either the argument I just gave above is totally wrong, or the human brain must be trained on 10e15 data points in the course of a human lifetime, or the genome must be substituting for the data points via priors, architectures, etc.
Is the second possibility plausible? I guess so. there are 10^9 seconds in a human lifetime, so if you are processing a million data points a second… Huh, that seems a bit much.
What about active learning and the like? You talked about how sufficiently big models are extracting all the info out of the data, and so that’s why you need more data to do better—but that suggests that curating the data to make it more info-dense should reduce compute requirements, right? Maybe that’s what humans are doing—“only” a billion data points in a lifetime, but really high-quality ones and good mechanisms for focusing on the right stuff to update on of all your sensory data coming in?
And then there’s the third possibility of course. The third possibility says: These scaling laws only apply to blank-slate, simple neural nets. The brain is not a blank slate, nor is it simple; it has lots of instincts and modules and priors etc. given to it by evolution. So that’s how humans can get away with only 10^9 data points or so. (well, I guess it should be more like 10^11, right? Each second of experience is more than just one data point, probably more like a hundred, right? What would you say?)
Then we look at the scaling law chart you just provided us, and we look at those L-shaped indifference curves, and we think: OK, so a task which can’t be done for less than 10e15 params is a task which requires 10e15 data points also.
In the picture, it looks like there’s something special about having a 1:1 ratio of data to params. But this is a coincidence due to the authors’ choice of units.
They define “one data point” as “one token,” which is fine. But it seems equally defensible to define “one data point” as “what the model can process in one forward pass,” which is ~1e3 tokens. If the authors had chosen that definition in their paper, I would be showing you a picture that looked identical except with different numbers on the data axis, and you would conclude from the picture that the brain should have around 1e12 data points to match its 1e15 params!
To state the point generally, the functional form of the scaling law says nothing about the actual ratio D/N where the indifference curves have their cusps. This depends on your choice of units. And, even if we were careful to use the same units, this ratio could be vastly different for different systems, and people would still say the systems “have the same scaling law.” Scaling is about relationships between differences, not relationships between absolute magnitudes.
On the larger topic, I’m pessimistic about our ability to figure out how many parameters the brain has, and even more pessimistic about our ability to understand what a reasonable scale for “a data point” is. This is mostly for “Could a Neuroscientist Understand a Microprocessor?”-type reasons. I would be more interested in an argument that starts with upper/lower bounds that feel absurdly extreme but relatively certain, and then tries to understand if (even) these weak bounds imply anything interesting, rather than an argument that aims for an point estimate or a subjective distribution.
They define “one data point” as “one token,” which is fine. But it seems equally defensible to define “one data point” as “what the model can process in one forward pass,” which is ~1e3 tokens. If the authors had chosen that definition in their paper, I would be showing you a picture that looked identical except with different numbers on the data axis, and you would conclude from the picture that the brain should have around 1e12 data points to match its 1e15 params!
Holy shit, mind blown! Then… how are the scaling laws useful at all then? I thought the whole point was to tell you how to divide your compute between… Oh, I see. The recommendations for how to divide up your compute would be the same regardless of which definition of data we used. I guess this suggests that it would be most convenient to define data as “how long you run the model during training” (which in turn is maybe “how many times the average parameter of the model is activated during training?”) Because that way we can just multiply parameter count by data to get our total compute cost. Or maybe instead we should do what Ajeya does, and define data as the number of updates to the model * the batch size, and then calculate compute by multiplying data * “horizon length.”
I’m very interested to hear your thoughts on Ajeya’s methodology. Is my sketch of it above accurate? Do you agree it’s a good methodology? Does it indeed imply (in conjunction with the scaling laws) that a model with 10^15 params should need 10^15 data points to train to a performance level that you couldn’t have got more easily with a smaller model—regardless of what the horizon length is, or what your training environment is, or what the task is?
...
As for the broader point, what do you think of the Carlsmith report? The figure given in the conclusion seems to give some absurdly extreme but reasonably certain upper and lower bounds. And I think the conclusions we draw from them are already drawn in Ajeya’s report, because she includes uncertainty about this in her model. I suppose you could just redo her model but with even more variance… that would probably make her timelines shorter, funnily enough!
Update: According to this the human brain actually is getting ~10^7 bits of data every second, although the highest level conscious awareness is only processing ~50. So insofar as we go with the “tokens” definition, it does seem that the human brain is processing plenty of tokens for its parameter count -- 10^16, in fact, over the course of its lifetime. More than enough! And insofar as we go with the “single pass through the network” definition, which would mean we are looking for about 10^12… then we get a small discrepancy; the maximum firing rate of neurons is 250 − 1000 times per second, which means 10^11.5 or so… actually this more or less checks out I’d say. Assuming it’s the max rate that matters and not the average rate (the average rate is about once per second).
Does this mean that it may not actually be true that humans are several OOMs more data-efficient than ANNs? Maybe the apparent data-efficiency advantage is really mostly just the result of transfer learning from vast previous life experience, just as GPT-3 can “few-shot learn” totally new tasks, and also “fine-tune” on relatively small amounts of data (3+ OOMs less, according to the transfer laws paper!) but really what’s going on is just transfer learning from its vast pre-training experience.
The scaling laws from the Kaplan et al papers do tell you this.
The relevant law is L(N,D), for the early-stopped test loss given parameter count N and data size D. It has the functional form
L(N,D)=[(Nc/N)αN/αD+(Dc/D)]αD
with αN∼0.076,αD∼0.095.
The result that you should scale D∝N0.74 comes from trying to keep the two terms in this formula about the same size.
This is not exactly a heuristic for managing compute (since D is not dependent on compute, it’s dependent on how much data you can source). It’s more like a heuristic for ensuring that your problem is the right level of difficulty to show off the power of this model size, as compared to smaller models.
You always can train models that are “too large” on datasets that are “too small” according to the heuristic, and they won’t diverge or do poorly or anything. They just won’t improve much upon the results of smaller models.
In terms of the above, you are setting N∼1015 and then asking what D ought to be. If the heuristic gives you an answer that seems very high, that doesn’t mean the model is “not as data efficient as you expected.” Rather, it means that you need a very large dataset if you want a good reason to push the parameter count up to N∼1015 rather than using a smaller model to get almost identical performance.
I find it more intuitive to think about the following, both discussed in the papers:
L(D), the N→∞ limit of L(N,D)
meaning: the peak data efficiency possible with this model class
L(N), the D→∞ limit of L(N,D)
meaning: the scaling of loss with parameters when not data-constrained but still using early stopping
If the Kaplan et al scaling results are relevant for AGI, I expect one of these two limits to provide the relevant constraint, rather than a careful balance between N and D to ensure we are not in either limit.
Ultimately, we expect AGI to require some specific-if-unknown level of performance (ie crossing some loss threshold LAGI). Ajeya’s approach essentially assumes that we’ll cross this threshold at a particular value of N, and then further assumes that this will happen in a regime where data and compute limitations are around the same order of magnitude.
I’m not sure why that ought to be true: it seems more likely that one side of the problem will become practically difficult to scale in proportion to the other, after a certain point, and we will essentially hug tight to either the L(N) or the L(D) curve until it hits LAGI.
See also my post here.
Huh, thanks, now I’m more confused about the scaling laws than I was before, in a good way! I appreciate the explanation you gave but am still confused. Some questions:
--In my discussion with Rohin I said:
Do you agree or disagree? My guess is that you’d disagree, since you say:
which I take to mean that you think the human brain could have had almost identical performance with much fewer synapses, since it has much more N than is appropriate given its D? (But wait, surely you don’t think that… OK, yeah, I’m just very confused here, please help!)
2. You say “This is not exactly a heuristic for managing compute (since D is not dependent on compute, it’s dependent on how much data you can source).” Well, isn’t it both? You can’t have more D than you have compute, in some sense, because D isn’t the amount of training examples you’ve collected, it’s the amount you actually use to train… right? So… isn’t this a heuristic for managing compute? It sure seemed like it was presented that way.
3. Perhaps it would help me if I could visualize it in two dimensions. Let the y-axis be parameter count, N, and the x-axis be data trained on, D. Make it a heat map with color = loss. Bluer = lower loss. It sounds to me like the compute-optimal scaling law Kaplan et al tout is something like a 45 degree line from the origin such that every point on the line has the lowest loss of all the points on an equivalent-compute indifference curve that contains that point. Whereas you are saying there are two other interesting lines, the L(D) line and the L(N) line, and the L(D) line is (say) a 60-degree line from the origin such that for any point on that line, all points straight above it are exactly as blue. And the L(N) line is (say) a 30-degree line from the origin such that for any point on that line, all points straight to the right of it are exactly as blue. This is the picture I currently have in my head, is it correct in your opinion? (And you are saying that probably when we hit AGI we won’t be on the 45-degree line but rather will be constrained by model size or by data and so will be hugging one of the other two lines)
This is a subtle and confusing thing about the Kaplan et al papers. (It’s also the subject of my post that I linked earlier, so I recommend you check that out.)
There are two things in the papers that could be called “optimal compute budgeting” laws:
A law that assumes a sufficiently large dataset (ie effectively infinite dataset), and tell you how to manage the tradeoff between steps S and params N.
The law we discussed above, that assumes a finite dataset, and then tells you how to manage its size D vs params N.
I said the D vs N law was “not a heuristic for managing compute” because the S vs N law is more directly about compute, and is what the authors mean when they talk about compute optimal budgeting.
However, the D vs N law does tell you about how to spend compute in an indirect way, for the exact reason you say, that D is related to how long you train. Comparing the two laws yields the “breakdown” or “kink point.”
Sorry, why do you expect I disagree? I think I agree. But also, I’m not really claiming the scaling laws say or don’t say anything about the brain, I’m just trying to clarify what they say about (specific kinds of) neural nets (on specific kinds of problems). We have to first understand what they predict about neural nets before we can go on to ask whether those predictions generalize to explain some other area.
This part is 100% qualitatively accurate, I think. The one exception is that there are two “optimal compute” lines on the plot with different slopes, for the two laws referred to above. But yeah, I’m saying we won’t be on either of those lines, but on the L(N) or the L(D) line.
I’ve read your linked post thrice now, it’s excellent, any remaining confusions are my fault.
I didn’t confidently expect you to disagree, I just guessed you did. The reason is that the statement you DID disagree with: ” The scaling laws, IIRC, don’t tell us how much data is needed to reach a useful level of performance. ” was, in my mind, closely related to the paragraph about the human brain which you agree with. Since they were closely related in my mind, I thought if you disagreed with one you’d disagree with the other. The statement about brains is the one I care more about, since it relates to my disagreement with Rohin.
I’m glad my 2D visualization is qualitatively correct! Quantitatively, roughly how many degrees do you think there would be between the L(D) and L(N) laws? In my example it was 30, but of course I just made that up.
Actually, I think I spoke too soon about the visualization… I don’t think your image of L(D) and L(N) is quite right.
Here is what the actual visualization looks like. More blue = lower loss, and I made it a contour plot so it’s easy to see indifference curves of the loss.
https://64.media.tumblr.com/8b1897853a66bccafa72043b2717a198/de8ee87db2e582fd-63/s540x810/8b960b152359e9379916ff878c80f130034d1cbb.png
In these coordinates, L(D) and L(N) are not really straight lines, but they are close to straight lines when we are far from the diagonal line:
If you look at the upper left region, the indifference curves are parallel to the vertical (N) axis. That is, in this regime, N doesn’t matter and loss is effectively a function of D alone.
This is L(D).
It looks like the color changes you see if you move horizontally through the upper left region.
Likewise, in the lower right region, D doesn’t matter and loss depends on N alone.
This is L(N).
It looks like the color changes you see if you move vertically through the lower right region.
To restate my earlier claims…
If either N or D is orders of magnitude larger than the other, then you get close to the same loss you would get from N ~ D ~ (whichever OOM is lower). So, setting eg (N, D) = (1e15, 1e12) would be sort of a waste of N, achieving only slightly lower loss than (N, D) = (1e12, 1e12).
This is what motives the heuristic that you scale D with N, to stay on the diagonal line.
On the other hand, if your goal is to reach some target loss and you have resource constraints, what matters is whichever resource constraint is more restrictive. For example, if we were never able to scale D above 1e12, then we would be stuck achieving a loss similar to GPT-3, never reaching the darkest colors on the graph.
When I said that it’s intuitive to think about L(D) and L(N), I mean that I care about which target losses we can reach. And that’s going to be set, more or less, by the highest N or the highest D we can reach, whichever is more restrictive.
Asking “what could we do with a N=1e15 model?” (or any other number) is kind of a weird question from the perspective of this plot. It could mean either of two very different situations: either we are in the top right corner with N and D scaled together, hitting the bluest region … or we are just near the top somewhere, in which case our loss is entirely determined by D and can be arbitrarily low.
In Ajeya’s work, this question means “let’s assume we’re using an N=1e15 model, and then let’s assume we actually need that many parameters, which must mean we want to reach the target losses in the upper right corner, and then let’s figure out how big D has to be to get there.”
So, the a priori choice of N=1e15 is driving the definition of sufficient performance, defined here as “the performance which you could only reach with N=1e15 params”.
What feels weird to me—which you touched on above—is the way this lets the scaling relations “backset drive” the definition of sufficient quality for AGI. Instead of saying we want to achieve some specific thing, then deducing we would need N=1e15 params to do it… we start with an unspecified goal and the postulate that we need N=1e15 params to reach it, and then derive the goal from there.
OK, wow, I didn’t realize the indifference curves were so close to being indifference L-shapes! Now I think Ajeya’s methodology was great after all—my worries have been largely dispelled!
Given that the indifference curves are so close to being L-shaped, it seems there’a a pretty strong argument to be made that since the human brain has 10e15 params or so, it must be doing some fairly important tasks which can’t be done (at least not as well) for much less than 10e15 params. Like, maybe a 10e13 param brain could do the task if it didn’t have to worry about other biological constraints like noisy neurons that occasionally die randomly, or being energy-efficient, etc. But probably these constraints and others like them aren’t that big a deal, such that we can be fairly confident that these tasks require a NN of 10e13 or more params.
The next step in the argument is to say that TAI requires one of these tasks. Then we point out that an AI which is bigger than the human brain should be able to do all the things it can do, in principle. Thus we feel justified in setting the parameter count of our hypothetical TAI to “within a few OOMs of 10e15.”
Then we look at the scaling law chart you just provided us, and we look at those L-shaped indifference curves, and we think: OK, so a task which can’t be done for less than 10e15 params is a task which requires 10e15 data points also. Because otherwise we could reduce parameter count below 10e15 and keep the same performance.
So I no longer feel weird about this; I feel like this part of Ajeya’s analysis makes sense.
But I am now intensely curious as to how many “data points” the human brain has. Either the argument I just gave above is totally wrong, or the human brain must be trained on 10e15 data points in the course of a human lifetime, or the genome must be substituting for the data points via priors, architectures, etc.
Is the second possibility plausible? I guess so. there are 10^9 seconds in a human lifetime, so if you are processing a million data points a second… Huh, that seems a bit much.
What about active learning and the like? You talked about how sufficiently big models are extracting all the info out of the data, and so that’s why you need more data to do better—but that suggests that curating the data to make it more info-dense should reduce compute requirements, right? Maybe that’s what humans are doing—“only” a billion data points in a lifetime, but really high-quality ones and good mechanisms for focusing on the right stuff to update on of all your sensory data coming in?
And then there’s the third possibility of course. The third possibility says: These scaling laws only apply to blank-slate, simple neural nets. The brain is not a blank slate, nor is it simple; it has lots of instincts and modules and priors etc. given to it by evolution. So that’s how humans can get away with only 10^9 data points or so. (well, I guess it should be more like 10^11, right? Each second of experience is more than just one data point, probably more like a hundred, right? What would you say?)
What do you think of these three possibilities?
I’m don’t think this step makes sense:
In the picture, it looks like there’s something special about having a 1:1 ratio of data to params. But this is a coincidence due to the authors’ choice of units.
They define “one data point” as “one token,” which is fine. But it seems equally defensible to define “one data point” as “what the model can process in one forward pass,” which is ~1e3 tokens. If the authors had chosen that definition in their paper, I would be showing you a picture that looked identical except with different numbers on the data axis, and you would conclude from the picture that the brain should have around 1e12 data points to match its 1e15 params!
To state the point generally, the functional form of the scaling law says nothing about the actual ratio D/N where the indifference curves have their cusps. This depends on your choice of units. And, even if we were careful to use the same units, this ratio could be vastly different for different systems, and people would still say the systems “have the same scaling law.” Scaling is about relationships between differences, not relationships between absolute magnitudes.
On the larger topic, I’m pessimistic about our ability to figure out how many parameters the brain has, and even more pessimistic about our ability to understand what a reasonable scale for “a data point” is. This is mostly for “Could a Neuroscientist Understand a Microprocessor?”-type reasons. I would be more interested in an argument that starts with upper/lower bounds that feel absurdly extreme but relatively certain, and then tries to understand if (even) these weak bounds imply anything interesting, rather than an argument that aims for an point estimate or a subjective distribution.
Holy shit, mind blown! Then… how are the scaling laws useful at all then? I thought the whole point was to tell you how to divide your compute between… Oh, I see. The recommendations for how to divide up your compute would be the same regardless of which definition of data we used. I guess this suggests that it would be most convenient to define data as “how long you run the model during training” (which in turn is maybe “how many times the average parameter of the model is activated during training?”) Because that way we can just multiply parameter count by data to get our total compute cost. Or maybe instead we should do what Ajeya does, and define data as the number of updates to the model * the batch size, and then calculate compute by multiplying data * “horizon length.”
I’m very interested to hear your thoughts on Ajeya’s methodology. Is my sketch of it above accurate? Do you agree it’s a good methodology? Does it indeed imply (in conjunction with the scaling laws) that a model with 10^15 params should need 10^15 data points to train to a performance level that you couldn’t have got more easily with a smaller model—regardless of what the horizon length is, or what your training environment is, or what the task is?
...
As for the broader point, what do you think of the Carlsmith report? The figure given in the conclusion seems to give some absurdly extreme but reasonably certain upper and lower bounds. And I think the conclusions we draw from them are already drawn in Ajeya’s report, because she includes uncertainty about this in her model. I suppose you could just redo her model but with even more variance… that would probably make her timelines shorter, funnily enough!
Update: According to this the human brain actually is getting ~10^7 bits of data every second, although the highest level conscious awareness is only processing ~50. So insofar as we go with the “tokens” definition, it does seem that the human brain is processing plenty of tokens for its parameter count -- 10^16, in fact, over the course of its lifetime. More than enough! And insofar as we go with the “single pass through the network” definition, which would mean we are looking for about 10^12… then we get a small discrepancy; the maximum firing rate of neurons is 250 − 1000 times per second, which means 10^11.5 or so… actually this more or less checks out I’d say. Assuming it’s the max rate that matters and not the average rate (the average rate is about once per second).
Does this mean that it may not actually be true that humans are several OOMs more data-efficient than ANNs? Maybe the apparent data-efficiency advantage is really mostly just the result of transfer learning from vast previous life experience, just as GPT-3 can “few-shot learn” totally new tasks, and also “fine-tune” on relatively small amounts of data (3+ OOMs less, according to the transfer laws paper!) but really what’s going on is just transfer learning from its vast pre-training experience.