Your reasoning here relies on the assumption that the learning mostly takes place during the individual organisms lifetime. But I think it’s widely accepted that brains are not “blank slates” at birth of the organism, but contain significant amount of information, akin to a pre-trained neural network. Thus, if we consider evolution as the training process, we might reach the opposite conclusion: Data quantity and training compute are extremely high, while parameter count (~brain size) and brain compute is restricted and selected against.
Much depends on what you mean by learning and mostly, but the evidence for some form of blank slate is overwhelming. Firstly most of the bits in the genome must code for cellular machinery and even then the total genome bits is absolutely tiny compared to brain synaptic bits. Then we have vast accumulating evidence from DL that nearly all the bits come from learning/experience, that optimal model bit complexity is proportional to dataset size (which not coincidentally is roughly on order 1e15 bits for humans − 1e9 seconds * 1e6 bit/s), and that the tiny tiny number of bits needed to specify architecture and learning hyperparams are simply a prior which can be overcome with more data. And there is much more.
If you think the human dataset size is 1e15 bits because you are counting each second as a million bits, then how is it that you think humans are vastly more data-efficient than ANNs? the human “pre-training” i.e. their childhood is OOMs bigger than the pre-training for even the largest language models of today.
(How many bits are in a token, for GPT-3? Idk, probably at most 20?)
I admit that I am confused about this stuff, this isn’t a “gotcha” but a genuine question.
TLDR: Humans have a radically different curiculum training tech which we have perfected over literally millenia which starts with a few years of pretraining on about 1e15 bits of lower value sensory data, and then gradually shifts more to training for another few decades on about 1e10 bits of higher value token/text data.
It is pretty likely that part of our apparent token/word data efficiency at abstract tasks does come from our everday physics sim capabilities which leverages the lower level vision/sensor modules trained on the larger ~1e15 bits (and many linguistics/philosophers were saying this long ago—the whole symbol grounding problem). And I agree with that. I suspect that is not the only source of our data efficiency, but yes I”m reasonably confident that AGI will require a much more human like curriculum training (with vision/sensor ‘pretraining’).
On the other hand we also have examples like hellen keller who place some rough limits on that transfer effect, and we have independent good reasons to believe the low level vision data is much more redundant (in part because the text stream is a compressed summary of what originally was low level vision/sensory data!).
Looking at it another way: this is the crux of human vs animal intelligence. An animal with similar lifespan and brain size (which are naturally correlated due to scaling laws!) only would have the 1e15 bits of sensory training data. Humans also curriculum train on 1e10 bits of a carefully curated subset of the total ~1e12 bits of accumulated human text-symbolic knowledge, which itself is a compression of the 1e26 bits of sensory data from all humans who have ever lived. Thus the intelligence of individual humans scales a bit with the total size of humanity, whereas it’s basically just constant for animals. Combine that with the exponential growth of human population and you get the observed hyperexponential trajectory leading to singularity around 2047. (prior LW discussion)
So the intelligence explosion people like Eliezer Yudkowsky and Luke Muehlhauser were possibly more right than they knew?
I’ll give them credit for predicting the idea that the future would be much weirder and unstable ala the Singularity long before Open Philanthropy saw the problem.
It also means stakes on the order of Pascal’s mugging are fairly likely this century, and we do live in the hinge of history.
I kinda hate summarizing EY (even EY-circa-2008) into a paragraph, but EY’s version of the intelligence explosion or singularity was focused heavily on recursively self improving AI that could quickly recode itself in ways humans presumably could not, and was influenced/associated with a pessimistic evolved modularity view of the brain that hasn’t aged well. Rapid takeoff, inefficient brains, evolved modularity, etc all tie together and self reinforce.
What has aged much better is the more systems-holistic singularity (moravec/kurzweil, john smart, etc) which credits (correctly) human intelligence to culture/language (human brains are just bog standard primate bains slightly scaled up 3x) - associated with softer takeoff as the AI advantage is mostly about allowing further exponential expansion of the size/population/complexity of the overall human memetic/cultural system. In this view recursive self improvement is just sort of baked in to acceleration rather than some new specific innovation of future AI, and AI itself is viewed as a continuation of humanity (little difference between de novo AI and uploads), rather than some new alien thing.
Your reasoning here relies on the assumption that the learning mostly takes place during the individual organisms lifetime. But I think it’s widely accepted that brains are not “blank slates” at birth of the organism, but contain significant amount of information, akin to a pre-trained neural network. Thus, if we consider evolution as the training process, we might reach the opposite conclusion: Data quantity and training compute are extremely high, while parameter count (~brain size) and brain compute is restricted and selected against.
Much depends on what you mean by learning and mostly, but the evidence for some form of blank slate is overwhelming. Firstly most of the bits in the genome must code for cellular machinery and even then the total genome bits is absolutely tiny compared to brain synaptic bits. Then we have vast accumulating evidence from DL that nearly all the bits come from learning/experience, that optimal model bit complexity is proportional to dataset size (which not coincidentally is roughly on order 1e15 bits for humans − 1e9 seconds * 1e6 bit/s), and that the tiny tiny number of bits needed to specify architecture and learning hyperparams are simply a prior which can be overcome with more data. And there is much more.
If you think the human dataset size is 1e15 bits because you are counting each second as a million bits, then how is it that you think humans are vastly more data-efficient than ANNs? the human “pre-training” i.e. their childhood is OOMs bigger than the pre-training for even the largest language models of today.
(How many bits are in a token, for GPT-3? Idk, probably at most 20?)
I admit that I am confused about this stuff, this isn’t a “gotcha” but a genuine question.
TLDR: Humans have a radically different curiculum training tech which we have perfected over literally millenia which starts with a few years of pretraining on about 1e15 bits of lower value sensory data, and then gradually shifts more to training for another few decades on about 1e10 bits of higher value token/text data.
It is pretty likely that part of our apparent token/word data efficiency at abstract tasks does come from our everday physics sim capabilities which leverages the lower level vision/sensor modules trained on the larger ~1e15 bits (and many linguistics/philosophers were saying this long ago—the whole symbol grounding problem). And I agree with that. I suspect that is not the only source of our data efficiency, but yes I”m reasonably confident that AGI will require a much more human like curriculum training (with vision/sensor ‘pretraining’).
On the other hand we also have examples like hellen keller who place some rough limits on that transfer effect, and we have independent good reasons to believe the low level vision data is much more redundant (in part because the text stream is a compressed summary of what originally was low level vision/sensory data!).
Looking at it another way: this is the crux of human vs animal intelligence. An animal with similar lifespan and brain size (which are naturally correlated due to scaling laws!) only would have the 1e15 bits of sensory training data. Humans also curriculum train on 1e10 bits of a carefully curated subset of the total ~1e12 bits of accumulated human text-symbolic knowledge, which itself is a compression of the 1e26 bits of sensory data from all humans who have ever lived. Thus the intelligence of individual humans scales a bit with the total size of humanity, whereas it’s basically just constant for animals. Combine that with the exponential growth of human population and you get the observed hyperexponential trajectory leading to singularity around 2047. (prior LW discussion)
So the intelligence explosion people like Eliezer Yudkowsky and Luke Muehlhauser were possibly more right than they knew?
I’ll give them credit for predicting the idea that the future would be much weirder and unstable ala the Singularity long before Open Philanthropy saw the problem.
It also means stakes on the order of Pascal’s mugging are fairly likely this century, and we do live in the hinge of history.
I kinda hate summarizing EY (even EY-circa-2008) into a paragraph, but EY’s version of the intelligence explosion or singularity was focused heavily on recursively self improving AI that could quickly recode itself in ways humans presumably could not, and was influenced/associated with a pessimistic evolved modularity view of the brain that hasn’t aged well. Rapid takeoff, inefficient brains, evolved modularity, etc all tie together and self reinforce.
What has aged much better is the more systems-holistic singularity (moravec/kurzweil, john smart, etc) which credits (correctly) human intelligence to culture/language (human brains are just bog standard primate bains slightly scaled up 3x) - associated with softer takeoff as the AI advantage is mostly about allowing further exponential expansion of the size/population/complexity of the overall human memetic/cultural system. In this view recursive self improvement is just sort of baked in to acceleration rather than some new specific innovation of future AI, and AI itself is viewed as a continuation of humanity (little difference between de novo AI and uploads), rather than some new alien thing.