They define “one data point” as “one token,” which is fine. But it seems equally defensible to define “one data point” as “what the model can process in one forward pass,” which is ~1e3 tokens. If the authors had chosen that definition in their paper, I would be showing you a picture that looked identical except with different numbers on the data axis, and you would conclude from the picture that the brain should have around 1e12 data points to match its 1e15 params!
Holy shit, mind blown! Then… how are the scaling laws useful at all then? I thought the whole point was to tell you how to divide your compute between… Oh, I see. The recommendations for how to divide up your compute would be the same regardless of which definition of data we used. I guess this suggests that it would be most convenient to define data as “how long you run the model during training” (which in turn is maybe “how many times the average parameter of the model is activated during training?”) Because that way we can just multiply parameter count by data to get our total compute cost. Or maybe instead we should do what Ajeya does, and define data as the number of updates to the model * the batch size, and then calculate compute by multiplying data * “horizon length.”
I’m very interested to hear your thoughts on Ajeya’s methodology. Is my sketch of it above accurate? Do you agree it’s a good methodology? Does it indeed imply (in conjunction with the scaling laws) that a model with 10^15 params should need 10^15 data points to train to a performance level that you couldn’t have got more easily with a smaller model—regardless of what the horizon length is, or what your training environment is, or what the task is?
...
As for the broader point, what do you think of the Carlsmith report? The figure given in the conclusion seems to give some absurdly extreme but reasonably certain upper and lower bounds. And I think the conclusions we draw from them are already drawn in Ajeya’s report, because she includes uncertainty about this in her model. I suppose you could just redo her model but with even more variance… that would probably make her timelines shorter, funnily enough!
Update: According to this the human brain actually is getting ~10^7 bits of data every second, although the highest level conscious awareness is only processing ~50. So insofar as we go with the “tokens” definition, it does seem that the human brain is processing plenty of tokens for its parameter count -- 10^16, in fact, over the course of its lifetime. More than enough! And insofar as we go with the “single pass through the network” definition, which would mean we are looking for about 10^12… then we get a small discrepancy; the maximum firing rate of neurons is 250 − 1000 times per second, which means 10^11.5 or so… actually this more or less checks out I’d say. Assuming it’s the max rate that matters and not the average rate (the average rate is about once per second).
Does this mean that it may not actually be true that humans are several OOMs more data-efficient than ANNs? Maybe the apparent data-efficiency advantage is really mostly just the result of transfer learning from vast previous life experience, just as GPT-3 can “few-shot learn” totally new tasks, and also “fine-tune” on relatively small amounts of data (3+ OOMs less, according to the transfer laws paper!) but really what’s going on is just transfer learning from its vast pre-training experience.
Holy shit, mind blown! Then… how are the scaling laws useful at all then? I thought the whole point was to tell you how to divide your compute between… Oh, I see. The recommendations for how to divide up your compute would be the same regardless of which definition of data we used. I guess this suggests that it would be most convenient to define data as “how long you run the model during training” (which in turn is maybe “how many times the average parameter of the model is activated during training?”) Because that way we can just multiply parameter count by data to get our total compute cost. Or maybe instead we should do what Ajeya does, and define data as the number of updates to the model * the batch size, and then calculate compute by multiplying data * “horizon length.”
I’m very interested to hear your thoughts on Ajeya’s methodology. Is my sketch of it above accurate? Do you agree it’s a good methodology? Does it indeed imply (in conjunction with the scaling laws) that a model with 10^15 params should need 10^15 data points to train to a performance level that you couldn’t have got more easily with a smaller model—regardless of what the horizon length is, or what your training environment is, or what the task is?
...
As for the broader point, what do you think of the Carlsmith report? The figure given in the conclusion seems to give some absurdly extreme but reasonably certain upper and lower bounds. And I think the conclusions we draw from them are already drawn in Ajeya’s report, because she includes uncertainty about this in her model. I suppose you could just redo her model but with even more variance… that would probably make her timelines shorter, funnily enough!
Update: According to this the human brain actually is getting ~10^7 bits of data every second, although the highest level conscious awareness is only processing ~50. So insofar as we go with the “tokens” definition, it does seem that the human brain is processing plenty of tokens for its parameter count -- 10^16, in fact, over the course of its lifetime. More than enough! And insofar as we go with the “single pass through the network” definition, which would mean we are looking for about 10^12… then we get a small discrepancy; the maximum firing rate of neurons is 250 − 1000 times per second, which means 10^11.5 or so… actually this more or less checks out I’d say. Assuming it’s the max rate that matters and not the average rate (the average rate is about once per second).
Does this mean that it may not actually be true that humans are several OOMs more data-efficient than ANNs? Maybe the apparent data-efficiency advantage is really mostly just the result of transfer learning from vast previous life experience, just as GPT-3 can “few-shot learn” totally new tasks, and also “fine-tune” on relatively small amounts of data (3+ OOMs less, according to the transfer laws paper!) but really what’s going on is just transfer learning from its vast pre-training experience.