OK, wow, I didn’t realize the indifference curves were so close to being indifference L-shapes! Now I think Ajeya’s methodology was great after all—my worries have been largely dispelled!
Given that the indifference curves are so close to being L-shaped, it seems there’a a pretty strong argument to be made that since the human brain has 10e15 params or so, it must be doing some fairly important tasks which can’t be done (at least not as well) for much less than 10e15 params. Like, maybe a 10e13 param brain could do the task if it didn’t have to worry about other biological constraints like noisy neurons that occasionally die randomly, or being energy-efficient, etc. But probably these constraints and others like them aren’t that big a deal, such that we can be fairly confident that these tasks require a NN of 10e13 or more params.
The next step in the argument is to say that TAI requires one of these tasks. Then we point out that an AI which is bigger than the human brain should be able to do all the things it can do, in principle. Thus we feel justified in setting the parameter count of our hypothetical TAI to “within a few OOMs of 10e15.”
Then we look at the scaling law chart you just provided us, and we look at those L-shaped indifference curves, and we think: OK, so a task which can’t be done for less than 10e15 params is a task which requires 10e15 data points also. Because otherwise we could reduce parameter count below 10e15 and keep the same performance.
So I no longer feel weird about this; I feel like this part of Ajeya’s analysis makes sense.
But I am now intensely curious as to how many “data points” the human brain has. Either the argument I just gave above is totally wrong, or the human brain must be trained on 10e15 data points in the course of a human lifetime, or the genome must be substituting for the data points via priors, architectures, etc.
Is the second possibility plausible? I guess so. there are 10^9 seconds in a human lifetime, so if you are processing a million data points a second… Huh, that seems a bit much.
What about active learning and the like? You talked about how sufficiently big models are extracting all the info out of the data, and so that’s why you need more data to do better—but that suggests that curating the data to make it more info-dense should reduce compute requirements, right? Maybe that’s what humans are doing—“only” a billion data points in a lifetime, but really high-quality ones and good mechanisms for focusing on the right stuff to update on of all your sensory data coming in?
And then there’s the third possibility of course. The third possibility says: These scaling laws only apply to blank-slate, simple neural nets. The brain is not a blank slate, nor is it simple; it has lots of instincts and modules and priors etc. given to it by evolution. So that’s how humans can get away with only 10^9 data points or so. (well, I guess it should be more like 10^11, right? Each second of experience is more than just one data point, probably more like a hundred, right? What would you say?)
Then we look at the scaling law chart you just provided us, and we look at those L-shaped indifference curves, and we think: OK, so a task which can’t be done for less than 10e15 params is a task which requires 10e15 data points also.
In the picture, it looks like there’s something special about having a 1:1 ratio of data to params. But this is a coincidence due to the authors’ choice of units.
They define “one data point” as “one token,” which is fine. But it seems equally defensible to define “one data point” as “what the model can process in one forward pass,” which is ~1e3 tokens. If the authors had chosen that definition in their paper, I would be showing you a picture that looked identical except with different numbers on the data axis, and you would conclude from the picture that the brain should have around 1e12 data points to match its 1e15 params!
To state the point generally, the functional form of the scaling law says nothing about the actual ratio D/N where the indifference curves have their cusps. This depends on your choice of units. And, even if we were careful to use the same units, this ratio could be vastly different for different systems, and people would still say the systems “have the same scaling law.” Scaling is about relationships between differences, not relationships between absolute magnitudes.
On the larger topic, I’m pessimistic about our ability to figure out how many parameters the brain has, and even more pessimistic about our ability to understand what a reasonable scale for “a data point” is. This is mostly for “Could a Neuroscientist Understand a Microprocessor?”-type reasons. I would be more interested in an argument that starts with upper/lower bounds that feel absurdly extreme but relatively certain, and then tries to understand if (even) these weak bounds imply anything interesting, rather than an argument that aims for an point estimate or a subjective distribution.
They define “one data point” as “one token,” which is fine. But it seems equally defensible to define “one data point” as “what the model can process in one forward pass,” which is ~1e3 tokens. If the authors had chosen that definition in their paper, I would be showing you a picture that looked identical except with different numbers on the data axis, and you would conclude from the picture that the brain should have around 1e12 data points to match its 1e15 params!
Holy shit, mind blown! Then… how are the scaling laws useful at all then? I thought the whole point was to tell you how to divide your compute between… Oh, I see. The recommendations for how to divide up your compute would be the same regardless of which definition of data we used. I guess this suggests that it would be most convenient to define data as “how long you run the model during training” (which in turn is maybe “how many times the average parameter of the model is activated during training?”) Because that way we can just multiply parameter count by data to get our total compute cost. Or maybe instead we should do what Ajeya does, and define data as the number of updates to the model * the batch size, and then calculate compute by multiplying data * “horizon length.”
I’m very interested to hear your thoughts on Ajeya’s methodology. Is my sketch of it above accurate? Do you agree it’s a good methodology? Does it indeed imply (in conjunction with the scaling laws) that a model with 10^15 params should need 10^15 data points to train to a performance level that you couldn’t have got more easily with a smaller model—regardless of what the horizon length is, or what your training environment is, or what the task is?
...
As for the broader point, what do you think of the Carlsmith report? The figure given in the conclusion seems to give some absurdly extreme but reasonably certain upper and lower bounds. And I think the conclusions we draw from them are already drawn in Ajeya’s report, because she includes uncertainty about this in her model. I suppose you could just redo her model but with even more variance… that would probably make her timelines shorter, funnily enough!
Update: According to this the human brain actually is getting ~10^7 bits of data every second, although the highest level conscious awareness is only processing ~50. So insofar as we go with the “tokens” definition, it does seem that the human brain is processing plenty of tokens for its parameter count -- 10^16, in fact, over the course of its lifetime. More than enough! And insofar as we go with the “single pass through the network” definition, which would mean we are looking for about 10^12… then we get a small discrepancy; the maximum firing rate of neurons is 250 − 1000 times per second, which means 10^11.5 or so… actually this more or less checks out I’d say. Assuming it’s the max rate that matters and not the average rate (the average rate is about once per second).
Does this mean that it may not actually be true that humans are several OOMs more data-efficient than ANNs? Maybe the apparent data-efficiency advantage is really mostly just the result of transfer learning from vast previous life experience, just as GPT-3 can “few-shot learn” totally new tasks, and also “fine-tune” on relatively small amounts of data (3+ OOMs less, according to the transfer laws paper!) but really what’s going on is just transfer learning from its vast pre-training experience.
OK, wow, I didn’t realize the indifference curves were so close to being indifference L-shapes! Now I think Ajeya’s methodology was great after all—my worries have been largely dispelled!
Given that the indifference curves are so close to being L-shaped, it seems there’a a pretty strong argument to be made that since the human brain has 10e15 params or so, it must be doing some fairly important tasks which can’t be done (at least not as well) for much less than 10e15 params. Like, maybe a 10e13 param brain could do the task if it didn’t have to worry about other biological constraints like noisy neurons that occasionally die randomly, or being energy-efficient, etc. But probably these constraints and others like them aren’t that big a deal, such that we can be fairly confident that these tasks require a NN of 10e13 or more params.
The next step in the argument is to say that TAI requires one of these tasks. Then we point out that an AI which is bigger than the human brain should be able to do all the things it can do, in principle. Thus we feel justified in setting the parameter count of our hypothetical TAI to “within a few OOMs of 10e15.”
Then we look at the scaling law chart you just provided us, and we look at those L-shaped indifference curves, and we think: OK, so a task which can’t be done for less than 10e15 params is a task which requires 10e15 data points also. Because otherwise we could reduce parameter count below 10e15 and keep the same performance.
So I no longer feel weird about this; I feel like this part of Ajeya’s analysis makes sense.
But I am now intensely curious as to how many “data points” the human brain has. Either the argument I just gave above is totally wrong, or the human brain must be trained on 10e15 data points in the course of a human lifetime, or the genome must be substituting for the data points via priors, architectures, etc.
Is the second possibility plausible? I guess so. there are 10^9 seconds in a human lifetime, so if you are processing a million data points a second… Huh, that seems a bit much.
What about active learning and the like? You talked about how sufficiently big models are extracting all the info out of the data, and so that’s why you need more data to do better—but that suggests that curating the data to make it more info-dense should reduce compute requirements, right? Maybe that’s what humans are doing—“only” a billion data points in a lifetime, but really high-quality ones and good mechanisms for focusing on the right stuff to update on of all your sensory data coming in?
And then there’s the third possibility of course. The third possibility says: These scaling laws only apply to blank-slate, simple neural nets. The brain is not a blank slate, nor is it simple; it has lots of instincts and modules and priors etc. given to it by evolution. So that’s how humans can get away with only 10^9 data points or so. (well, I guess it should be more like 10^11, right? Each second of experience is more than just one data point, probably more like a hundred, right? What would you say?)
What do you think of these three possibilities?
I’m don’t think this step makes sense:
In the picture, it looks like there’s something special about having a 1:1 ratio of data to params. But this is a coincidence due to the authors’ choice of units.
They define “one data point” as “one token,” which is fine. But it seems equally defensible to define “one data point” as “what the model can process in one forward pass,” which is ~1e3 tokens. If the authors had chosen that definition in their paper, I would be showing you a picture that looked identical except with different numbers on the data axis, and you would conclude from the picture that the brain should have around 1e12 data points to match its 1e15 params!
To state the point generally, the functional form of the scaling law says nothing about the actual ratio D/N where the indifference curves have their cusps. This depends on your choice of units. And, even if we were careful to use the same units, this ratio could be vastly different for different systems, and people would still say the systems “have the same scaling law.” Scaling is about relationships between differences, not relationships between absolute magnitudes.
On the larger topic, I’m pessimistic about our ability to figure out how many parameters the brain has, and even more pessimistic about our ability to understand what a reasonable scale for “a data point” is. This is mostly for “Could a Neuroscientist Understand a Microprocessor?”-type reasons. I would be more interested in an argument that starts with upper/lower bounds that feel absurdly extreme but relatively certain, and then tries to understand if (even) these weak bounds imply anything interesting, rather than an argument that aims for an point estimate or a subjective distribution.
Holy shit, mind blown! Then… how are the scaling laws useful at all then? I thought the whole point was to tell you how to divide your compute between… Oh, I see. The recommendations for how to divide up your compute would be the same regardless of which definition of data we used. I guess this suggests that it would be most convenient to define data as “how long you run the model during training” (which in turn is maybe “how many times the average parameter of the model is activated during training?”) Because that way we can just multiply parameter count by data to get our total compute cost. Or maybe instead we should do what Ajeya does, and define data as the number of updates to the model * the batch size, and then calculate compute by multiplying data * “horizon length.”
I’m very interested to hear your thoughts on Ajeya’s methodology. Is my sketch of it above accurate? Do you agree it’s a good methodology? Does it indeed imply (in conjunction with the scaling laws) that a model with 10^15 params should need 10^15 data points to train to a performance level that you couldn’t have got more easily with a smaller model—regardless of what the horizon length is, or what your training environment is, or what the task is?
...
As for the broader point, what do you think of the Carlsmith report? The figure given in the conclusion seems to give some absurdly extreme but reasonably certain upper and lower bounds. And I think the conclusions we draw from them are already drawn in Ajeya’s report, because she includes uncertainty about this in her model. I suppose you could just redo her model but with even more variance… that would probably make her timelines shorter, funnily enough!
Update: According to this the human brain actually is getting ~10^7 bits of data every second, although the highest level conscious awareness is only processing ~50. So insofar as we go with the “tokens” definition, it does seem that the human brain is processing plenty of tokens for its parameter count -- 10^16, in fact, over the course of its lifetime. More than enough! And insofar as we go with the “single pass through the network” definition, which would mean we are looking for about 10^12… then we get a small discrepancy; the maximum firing rate of neurons is 250 − 1000 times per second, which means 10^11.5 or so… actually this more or less checks out I’d say. Assuming it’s the max rate that matters and not the average rate (the average rate is about once per second).
Does this mean that it may not actually be true that humans are several OOMs more data-efficient than ANNs? Maybe the apparent data-efficiency advantage is really mostly just the result of transfer learning from vast previous life experience, just as GPT-3 can “few-shot learn” totally new tasks, and also “fine-tune” on relatively small amounts of data (3+ OOMs less, according to the transfer laws paper!) but really what’s going on is just transfer learning from its vast pre-training experience.