I would point out that your calculations are based on the incident data our senses pick up, whereas what we learn is based on the information received by our brain. Almost all of the incident data is thrown away much closer to the source. This works because some of the first things we learn/have hard coded are about what data our sensory organs can disregard, because there’s so much redundancy. The data rate of our optical and auditory nerves are only about 0.1% (OOM) of the data rates you list here (I couldn’t find a good estimate for tactile but it seems like it should be much lower than visual and auditory). Not sure how much other preprocessing and discarding of data happens elsewhere, but it doesn’t take that many more steps to close the remaining 1.5 OOMs gap. (1.3 OOMs if you subtract the ~10 years spent asleep).
OTOH I don’t think this necessarily undermines your overall conclusion? We’re not training LLMs on the same tasks that brains are being trained on, and we’re not measuring them by the same metrics, so I’m not sure how to make the comparison fairly.
But then again, if I were to try to read everything in the GPT-4 training data, it would take me on the order of a century of continuous focus and effort given my reading speed just for data input, before doing any other thinking about the data. Adding in the option for using my other senses (voice to text, braille if I knew how, some sort of olfactory encoding if it existed) wouldn’t help because I can’t actually process those streams in parallel. And I’m only capable of using a tiny fraction of my sense data for language; it’s not like I can read outside the fovea, or reading overlapping texts in colors tuned to each cone type, or listen to a hundred conversations at once if they’re at different frequencies.
You mention “I would point out that your calculations are based on the incident data our senses pick up, whereas what we learn is based on the information received by our brain. Almost all of the incident data is thrown away much closer to the source.”
Wouldn’t this be similar to how a Neural Network “disregards” training data that it has already seen? i.e. If it has already learned that pattern, there’s no gradient so the loss wouldn’t go down. Maybe there’s another mechanism that we’re missing in current neural nets online training, that would increase training efficiency by recognizing redundant data and prevent a feedforward pass. Tesla does this in an engineered manner where they throw away most data at the source and only learn on “surprise/interventions”, which is data that generates a gradient.
I don’t really get what you mean by “Not sure how much other preprocessing and discarding of data happens elsewhere, but it doesn’t take that many more steps to close the remaining 1.5 OOMs gap.” Are you saying that the real calculations are closer to 1.5 orders of magnitude of what I calculated or 1.5% of what I calculated?
Wouldn’t this be similar to how a Neural Network “disregards” training data that it has already seen?
I don’t know how that’s done, sorry. Does it literally throw away the the data without using it for anything whatsoever (And does it do this with on the order of 99.9% of the training data set?)? Or does it process the data but then because it is redundant it has no or almost no effect on the model weights? I’m talking about the former, since the vast majority of our visual data never makes it from the retina to the optic nerve. The latter would be something more like how looking at my bedroom wall yet again has little to no effect on my understanding of any aspect of the world.
And to your second point, yeah I was pretty unclear, sorry. I meant, your original calculation was that a human at age 30 has ~31,728 T tokens worth of data, compared to 1T for GPT4. The human has 31728 times as much, and log (31728) is about 4.5, meaning the human has 4.5 OOMs more training data. But if I’m right that you should cut down your human training data amounts by ~1000x because of throwing it away before it gets processed in the brain at all, then we’re left with a human at age 30 having only 31.728x as much. log(31.728)~1.5, aka the human has 1.5 OOMs more training data. The rest of that comment was me indicating that that’s just how much data gets to the brain in any form, not how much is actually being processed for training purposes.
I would point out that your calculations are based on the incident data our senses pick up, whereas what we learn is based on the information received by our brain. Almost all of the incident data is thrown away much closer to the source. This works because some of the first things we learn/have hard coded are about what data our sensory organs can disregard, because there’s so much redundancy. The data rate of our optical and auditory nerves are only about 0.1% (OOM) of the data rates you list here (I couldn’t find a good estimate for tactile but it seems like it should be much lower than visual and auditory). Not sure how much other preprocessing and discarding of data happens elsewhere, but it doesn’t take that many more steps to close the remaining 1.5 OOMs gap. (1.3 OOMs if you subtract the ~10 years spent asleep).
OTOH I don’t think this necessarily undermines your overall conclusion? We’re not training LLMs on the same tasks that brains are being trained on, and we’re not measuring them by the same metrics, so I’m not sure how to make the comparison fairly.
But then again, if I were to try to read everything in the GPT-4 training data, it would take me on the order of a century of continuous focus and effort given my reading speed just for data input, before doing any other thinking about the data. Adding in the option for using my other senses (voice to text, braille if I knew how, some sort of olfactory encoding if it existed) wouldn’t help because I can’t actually process those streams in parallel. And I’m only capable of using a tiny fraction of my sense data for language; it’s not like I can read outside the fovea, or reading overlapping texts in colors tuned to each cone type, or listen to a hundred conversations at once if they’re at different frequencies.
You mention “I would point out that your calculations are based on the incident data our senses pick up, whereas what we learn is based on the information received by our brain. Almost all of the incident data is thrown away much closer to the source.”
Wouldn’t this be similar to how a Neural Network “disregards” training data that it has already seen? i.e. If it has already learned that pattern, there’s no gradient so the loss wouldn’t go down. Maybe there’s another mechanism that we’re missing in current neural nets online training, that would increase training efficiency by recognizing redundant data and prevent a feedforward pass. Tesla does this in an engineered manner where they throw away most data at the source and only learn on “surprise/interventions”, which is data that generates a gradient.
I don’t really get what you mean by “Not sure how much other preprocessing and discarding of data happens elsewhere, but it doesn’t take that many more steps to close the remaining 1.5 OOMs gap.” Are you saying that the real calculations are closer to 1.5 orders of magnitude of what I calculated or 1.5% of what I calculated?
I don’t know how that’s done, sorry. Does it literally throw away the the data without using it for anything whatsoever (And does it do this with on the order of 99.9% of the training data set?)? Or does it process the data but then because it is redundant it has no or almost no effect on the model weights? I’m talking about the former, since the vast majority of our visual data never makes it from the retina to the optic nerve. The latter would be something more like how looking at my bedroom wall yet again has little to no effect on my understanding of any aspect of the world.
And to your second point, yeah I was pretty unclear, sorry. I meant, your original calculation was that a human at age 30 has ~31,728 T tokens worth of data, compared to 1T for GPT4. The human has 31728 times as much, and log (31728) is about 4.5, meaning the human has 4.5 OOMs more training data. But if I’m right that you should cut down your human training data amounts by ~1000x because of throwing it away before it gets processed in the brain at all, then we’re left with a human at age 30 having only 31.728x as much. log(31.728)~1.5, aka the human has 1.5 OOMs more training data. The rest of that comment was me indicating that that’s just how much data gets to the brain in any form, not how much is actually being processed for training purposes.