2. Humans “feel” better than even SOTA language models, but need less training data than those models, even though right now the only way to improve the models is through more training data. What am I supposed to conclude from this? Are humans running on such a different paradigm that none of this matters? Or is it just that humans are better at common-sense language tasks, but worse at token-prediction language tasks, in some way where the tails come apart once language models get good enough?
Why do we say that we need less training data? Every minute instant of our existence is a multisensory point of data from before we’ve even exited the womb. We spend months, arguably years, hardly capable of anything at all yet still taking and retaining data. Unsupervised and mostly redundant, sure, but certainly not less than a curated collection of Internet text. By the time we’re teaching a child to say “dog” for the first time they’ve probably experienced millions of fragments of data on creatures of various limb quantities, hair and fur types, sizes, sounds and smells, etc.; so they’re already effectively pretrained on animals before we first provide a supervised connection between the sound “dog” and the sight of a four-limbed hairy creature with long ears on a leash.
I believe that Humans exceed the amount of data ML models have by multiple orders of magnitude by the time we’re adults, even if it’s extremely messy.
I did some calculations with a bunch of assumptions and simplifications but here’s a high estimate, back of the envelope calculation for the data and “tokens” a 30 year old human would have “trained” on:
Visual data: 130 million photoreceptor cells, firing at 10 Hz = 1.3Gbits/s = 162.5 MB/s over 30 years (aprox. 946,080,000 seconds) = 153 Petabytes
Auditory data: Humans can hear frequencies up to 20,000 Hz, high quality audio is sampled at 44.1 kHz satisfying Nyquist-Shannon sampling theorem, if we assume a 16bit (cd quality)*2(channels for stereo) = 1.41 Mbits/s = .18 MB/s over 30 years = .167 Petabytes
Tactile data: 4 million touch receptors providing 8 bits/s (assuming they account for temperature, pressure, pain, hair movement, vibration) = 5 MB/s over 30 years = 4.73 Petabytes
Olfactory data: We can detect up to 1 trillion smells , assuming we process 1 smell every second and each smell is represented a its own piece of data i.e. log2(1trillion) = 40 bits/s = 0.0000050 MB/s over 30 years = .000004 Petabytes
Taste data: 10,000 receptors, assuming a unique identifier for each basic taste (sweet, sour, salty, bitter and umami) log2(5) 2.3 bits rounded up to 3 = 30 kbits/s = 0.00375 MB/s over 30 years = .00035 Petabytes
This amounts to 153 + .167 + 4.73 + .000004 + .00035 = 158.64 Petabytes assuming 5 bytes per token (i.e. 5 characters) this amounts to 31,728 T tokens
This is of course a high estimate and most of this data will clearly have huge compression capacity, but I wanted to get a rough estimate of a high upper bound.
Here’s the google sheet if anyone wants to copy it or contribute
There’s a billion seconds in 30 years. Chinchilla was trained on 1.4 trillion tokens. So for a human adult to have as much data as chinchilla would require us to process the equivalent of ~1400 tokens per second. I think that’s something like 2 kilobyte per second.
Inputs to the human brain are probably dominated by vision. I’m not sure how many bytes per second we see, but I don’t think it’s many orders of magnitudes higher than 2kb.
(If 1 firing = 1 bit, that should be 34 megabit ~= 4 megabyte.)
This random article (which I haven’t fact-checked in the least) claims a bandwidth of 8.75 megabit ~= 1 megabyte. So that’s like 2.5 OOMs higher than the number I claimed for chinchilla. So yeah, it does seem like humans get more raw data.
(But I still suspect that chinchilla gets more data if you adjust for (un)interestingness. Where totally random data and easily predictable/compressible data are interesting, and data that is hard-but-possible to predict/compress is interesting.)
Why do we say that we need less training data? Every minute instant of our existence is a multisensory point of data from before we’ve even exited the womb. We spend months, arguably years, hardly capable of anything at all yet still taking and retaining data. Unsupervised and mostly redundant, sure, but certainly not less than a curated collection of Internet text. By the time we’re teaching a child to say “dog” for the first time they’ve probably experienced millions of fragments of data on creatures of various limb quantities, hair and fur types, sizes, sounds and smells, etc.; so they’re already effectively pretrained on animals before we first provide a supervised connection between the sound “dog” and the sight of a four-limbed hairy creature with long ears on a leash.
I believe that Humans exceed the amount of data ML models have by multiple orders of magnitude by the time we’re adults, even if it’s extremely messy.
I did some calculations with a bunch of assumptions and simplifications but here’s a high estimate, back of the envelope calculation for the data and “tokens” a 30 year old human would have “trained” on:
Visual data: 130 million photoreceptor cells, firing at 10 Hz = 1.3Gbits/s = 162.5 MB/s over 30 years (aprox. 946,080,000 seconds) = 153 Petabytes
Auditory data: Humans can hear frequencies up to 20,000 Hz, high quality audio is sampled at 44.1 kHz satisfying Nyquist-Shannon sampling theorem, if we assume a 16bit (cd quality)*2(channels for stereo) = 1.41 Mbits/s = .18 MB/s over 30 years = .167 Petabytes
Tactile data: 4 million touch receptors providing 8 bits/s (assuming they account for temperature, pressure, pain, hair movement, vibration) = 5 MB/s over 30 years = 4.73 Petabytes
Olfactory data: We can detect up to 1 trillion smells , assuming we process 1 smell every second and each smell is represented a its own piece of data i.e. log2(1trillion) = 40 bits/s = 0.0000050 MB/s over 30 years = .000004 Petabytes
Taste data: 10,000 receptors, assuming a unique identifier for each basic taste (sweet, sour, salty, bitter and umami) log2(5) 2.3 bits rounded up to 3 = 30 kbits/s = 0.00375 MB/s over 30 years = .00035 Petabytes
This amounts to 153 + .167 + 4.73 + .000004 + .00035 = 158.64 Petabytes assuming 5 bytes per token (i.e. 5 characters) this amounts to 31,728 T tokens
This is of course a high estimate and most of this data will clearly have huge compression capacity, but I wanted to get a rough estimate of a high upper bound.
Here’s the google sheet if anyone wants to copy it or contribute
There’s a billion seconds in 30 years. Chinchilla was trained on 1.4 trillion tokens. So for a human adult to have as much data as chinchilla would require us to process the equivalent of ~1400 tokens per second. I think that’s something like 2 kilobyte per second.
Inputs to the human brain are probably dominated by vision. I’m not sure how many bytes per second we see, but I don’t think it’s many orders of magnitudes higher than 2kb.
That depends a lot on how you count. A quick Googling suggest that the optic nerve has 1.7 million nerve fibers.
If you think about a neuron firing rate of 20 hz that gives you 34 MB per second.
(If 1 firing = 1 bit, that should be 34 megabit ~= 4 megabyte.)
This random article (which I haven’t fact-checked in the least) claims a bandwidth of 8.75 megabit ~= 1 megabyte. So that’s like 2.5 OOMs higher than the number I claimed for chinchilla. So yeah, it does seem like humans get more raw data.
(But I still suspect that chinchilla gets more data if you adjust for (un)interestingness. Where totally random data and easily predictable/compressible data are interesting, and data that is hard-but-possible to predict/compress is interesting.)