One major update from the Chinchilla paper against the NN timelines that this post doesn’t capture (inspired by this comment by Rohin):
Based on Kaplan scaling laws, we might’ve expected that raw parameter count was the best predictor of capabilities. Chinchilla scaling laws introduced a new component, data quantity, that was not incorporated in the original report.
Chinchilla scaling laws provide the compute-optimal trade off between datapoints and parameters, but not the cost-optimal trade off (assuming that costs come from both using more compute, and observing more datapoints). In biological systems, the marginal cost of doubling the amount of data is very high, since that requires doubling the organism’s lifespan or doubling its neuron throughput, which are basically hard constraints. This means that human brains may be very far from “compute optimal” in the zero-datapoint-cost limit suggested by Chinchilla, implying ANN models much smaller than brain-size (estimated at 10T-parameters) may achieve human-level performance given compute-optimal quantities of data.
In other words, the big takeaway is that we should update away from human-level FLOPS as a good bio-anchor independent of the number of training datapoints, since we have reason to believe that human brains face other constraints which suboptimally inflate the number of FLOPS brains use to attain a given level of performance.
No—at least not for these reasons—see my longer reply lower down but what probably matters most is total search volume (model size * training time), which is basically FLOPS/memops. A smaller model can train longer to get up to the same capabilities for roughly the same total compute budget, but for AGI the faster learning model is more intelligent in any useful sense. And of course the human brain is probably pretty close to practical limits for equivalent FLOPs learning efficiency.
To first order approximation total flops predicts ANN/BNN capabilities quite well. GPT3 training was 3e23 flops, a 30 year old human brain is roughly 1e23 flops equivalent (1e9 seconds * 1e14 flops/s). GPT3 is only really equivalent to say 10% of the human brain at best (linguistic related cortices), but naturally the brain is still at least an OOM more flops efficient.
On the other hand, humans are good at active learning — selecting the datapoints which lead to the most efficient progress. Relative to Chinchilla scaling laws which assume no active learning, humans may be using their computation far more efficiently.
Gradient descent is still a form of search and what matters most is the total search volume. In the overparameterized regime (which ANNs are now entering and BNNs swim in) performance (assuming not limited by data quality) is roughly predicted by (model size * training time). It doesn’t matter greatly whether you train a model twice as large for half as long or vice versa—in either case it’s the total search volume that matters, because in the overparam regime you are searching for needles in the circuit space haystack.
However, human intelligence (at the high end) is to a first and second approximation simply learning speed and thus data efficiency. Even if the smaller brain/model trained for much longer has equivalent capability now, the larger model/brain still learns faster given the same new data, and is thus more intelligent in the way more relevant for human level AGI. We have vastly more ability to scale compute than we can scale high quality training data.
It’s dangerous to infer much from the ‘chinchilla scaling laws’ - humans exceed NLM performance on downstream tasks using only a few billion token equivalent, so using 2 OOM or more less data. These internet size datasets are mostly garbage. Human brains are curriculum trained on a much higher quality and quality-sorted multimodal dataset which almost certainly has very different scaling than the random/unsorted order used in chinchilla. A vastly larger mind/model could probably learn as well using even OOM less data.
The only real conclusion from chinchilla scaling is that for that particular species of transformer NLM trained on that particular internet scale dataset, the optimal token/param ratio is about 30x. But that doesn’t even mean you’d get the same scaling curve or same optimal token/param ratio for a different arch on a different dataset with different curation.
Your reasoning here relies on the assumption that the learning mostly takes place during the individual organisms lifetime. But I think it’s widely accepted that brains are not “blank slates” at birth of the organism, but contain significant amount of information, akin to a pre-trained neural network. Thus, if we consider evolution as the training process, we might reach the opposite conclusion: Data quantity and training compute are extremely high, while parameter count (~brain size) and brain compute is restricted and selected against.
Much depends on what you mean by learning and mostly, but the evidence for some form of blank slate is overwhelming. Firstly most of the bits in the genome must code for cellular machinery and even then the total genome bits is absolutely tiny compared to brain synaptic bits. Then we have vast accumulating evidence from DL that nearly all the bits come from learning/experience, that optimal model bit complexity is proportional to dataset size (which not coincidentally is roughly on order 1e15 bits for humans − 1e9 seconds * 1e6 bit/s), and that the tiny tiny number of bits needed to specify architecture and learning hyperparams are simply a prior which can be overcome with more data. And there is much more.
If you think the human dataset size is 1e15 bits because you are counting each second as a million bits, then how is it that you think humans are vastly more data-efficient than ANNs? the human “pre-training” i.e. their childhood is OOMs bigger than the pre-training for even the largest language models of today.
(How many bits are in a token, for GPT-3? Idk, probably at most 20?)
I admit that I am confused about this stuff, this isn’t a “gotcha” but a genuine question.
TLDR: Humans have a radically different curiculum training tech which we have perfected over literally millenia which starts with a few years of pretraining on about 1e15 bits of lower value sensory data, and then gradually shifts more to training for another few decades on about 1e10 bits of higher value token/text data.
It is pretty likely that part of our apparent token/word data efficiency at abstract tasks does come from our everday physics sim capabilities which leverages the lower level vision/sensor modules trained on the larger ~1e15 bits (and many linguistics/philosophers were saying this long ago—the whole symbol grounding problem). And I agree with that. I suspect that is not the only source of our data efficiency, but yes I”m reasonably confident that AGI will require a much more human like curriculum training (with vision/sensor ‘pretraining’).
On the other hand we also have examples like hellen keller who place some rough limits on that transfer effect, and we have independent good reasons to believe the low level vision data is much more redundant (in part because the text stream is a compressed summary of what originally was low level vision/sensory data!).
Looking at it another way: this is the crux of human vs animal intelligence. An animal with similar lifespan and brain size (which are naturally correlated due to scaling laws!) only would have the 1e15 bits of sensory training data. Humans also curriculum train on 1e10 bits of a carefully curated subset of the total ~1e12 bits of accumulated human text-symbolic knowledge, which itself is a compression of the 1e26 bits of sensory data from all humans who have ever lived. Thus the intelligence of individual humans scales a bit with the total size of humanity, whereas it’s basically just constant for animals. Combine that with the exponential growth of human population and you get the observed hyperexponential trajectory leading to singularity around 2047. (prior LW discussion)
So the intelligence explosion people like Eliezer Yudkowsky and Luke Muehlhauser were possibly more right than they knew?
I’ll give them credit for predicting the idea that the future would be much weirder and unstable ala the Singularity long before Open Philanthropy saw the problem.
It also means stakes on the order of Pascal’s mugging are fairly likely this century, and we do live in the hinge of history.
I kinda hate summarizing EY (even EY-circa-2008) into a paragraph, but EY’s version of the intelligence explosion or singularity was focused heavily on recursively self improving AI that could quickly recode itself in ways humans presumably could not, and was influenced/associated with a pessimistic evolved modularity view of the brain that hasn’t aged well. Rapid takeoff, inefficient brains, evolved modularity, etc all tie together and self reinforce.
What has aged much better is the more systems-holistic singularity (moravec/kurzweil, john smart, etc) which credits (correctly) human intelligence to culture/language (human brains are just bog standard primate bains slightly scaled up 3x) - associated with softer takeoff as the AI advantage is mostly about allowing further exponential expansion of the size/population/complexity of the overall human memetic/cultural system. In this view recursive self improvement is just sort of baked in to acceleration rather than some new specific innovation of future AI, and AI itself is viewed as a continuation of humanity (little difference between de novo AI and uploads), rather than some new alien thing.
One major update from the Chinchilla paper against the NN timelines that this post doesn’t capture (inspired by this comment by Rohin):
Based on Kaplan scaling laws, we might’ve expected that raw parameter count was the best predictor of capabilities. Chinchilla scaling laws introduced a new component, data quantity, that was not incorporated in the original report.
Chinchilla scaling laws provide the compute-optimal trade off between datapoints and parameters, but not the cost-optimal trade off (assuming that costs come from both using more compute, and observing more datapoints). In biological systems, the marginal cost of doubling the amount of data is very high, since that requires doubling the organism’s lifespan or doubling its neuron throughput, which are basically hard constraints. This means that human brains may be very far from “compute optimal” in the zero-datapoint-cost limit suggested by Chinchilla, implying ANN models much smaller than brain-size (estimated at 10T-parameters) may achieve human-level performance given compute-optimal quantities of data.
In other words, the big takeaway is that we should update away from human-level FLOPS as a good bio-anchor independent of the number of training datapoints, since we have reason to believe that human brains face other constraints which suboptimally inflate the number of FLOPS brains use to attain a given level of performance.
And in particular we should update towards below-human-level FLOPS.
No—at least not for these reasons—see my longer reply lower down but what probably matters most is total search volume (model size * training time), which is basically FLOPS/memops. A smaller model can train longer to get up to the same capabilities for roughly the same total compute budget, but for AGI the faster learning model is more intelligent in any useful sense. And of course the human brain is probably pretty close to practical limits for equivalent FLOPs learning efficiency.
To first order approximation total flops predicts ANN/BNN capabilities quite well. GPT3 training was 3e23 flops, a 30 year old human brain is roughly 1e23 flops equivalent (1e9 seconds * 1e14 flops/s). GPT3 is only really equivalent to say 10% of the human brain at best (linguistic related cortices), but naturally the brain is still at least an OOM more flops efficient.
On the other hand, humans are good at active learning — selecting the datapoints which lead to the most efficient progress. Relative to Chinchilla scaling laws which assume no active learning, humans may be using their computation far more efficiently.
Gradient descent is still a form of search and what matters most is the total search volume. In the overparameterized regime (which ANNs are now entering and BNNs swim in) performance (assuming not limited by data quality) is roughly predicted by (model size * training time). It doesn’t matter greatly whether you train a model twice as large for half as long or vice versa—in either case it’s the total search volume that matters, because in the overparam regime you are searching for needles in the circuit space haystack.
However, human intelligence (at the high end) is to a first and second approximation simply learning speed and thus data efficiency. Even if the smaller brain/model trained for much longer has equivalent capability now, the larger model/brain still learns faster given the same new data, and is thus more intelligent in the way more relevant for human level AGI. We have vastly more ability to scale compute than we can scale high quality training data.
It’s dangerous to infer much from the ‘chinchilla scaling laws’ - humans exceed NLM performance on downstream tasks using only a few billion token equivalent, so using 2 OOM or more less data. These internet size datasets are mostly garbage. Human brains are curriculum trained on a much higher quality and quality-sorted multimodal dataset which almost certainly has very different scaling than the random/unsorted order used in chinchilla. A vastly larger mind/model could probably learn as well using even OOM less data.
The only real conclusion from chinchilla scaling is that for that particular species of transformer NLM trained on that particular internet scale dataset, the optimal token/param ratio is about 30x. But that doesn’t even mean you’d get the same scaling curve or same optimal token/param ratio for a different arch on a different dataset with different curation.
Your reasoning here relies on the assumption that the learning mostly takes place during the individual organisms lifetime. But I think it’s widely accepted that brains are not “blank slates” at birth of the organism, but contain significant amount of information, akin to a pre-trained neural network. Thus, if we consider evolution as the training process, we might reach the opposite conclusion: Data quantity and training compute are extremely high, while parameter count (~brain size) and brain compute is restricted and selected against.
Much depends on what you mean by learning and mostly, but the evidence for some form of blank slate is overwhelming. Firstly most of the bits in the genome must code for cellular machinery and even then the total genome bits is absolutely tiny compared to brain synaptic bits. Then we have vast accumulating evidence from DL that nearly all the bits come from learning/experience, that optimal model bit complexity is proportional to dataset size (which not coincidentally is roughly on order 1e15 bits for humans − 1e9 seconds * 1e6 bit/s), and that the tiny tiny number of bits needed to specify architecture and learning hyperparams are simply a prior which can be overcome with more data. And there is much more.
If you think the human dataset size is 1e15 bits because you are counting each second as a million bits, then how is it that you think humans are vastly more data-efficient than ANNs? the human “pre-training” i.e. their childhood is OOMs bigger than the pre-training for even the largest language models of today.
(How many bits are in a token, for GPT-3? Idk, probably at most 20?)
I admit that I am confused about this stuff, this isn’t a “gotcha” but a genuine question.
TLDR: Humans have a radically different curiculum training tech which we have perfected over literally millenia which starts with a few years of pretraining on about 1e15 bits of lower value sensory data, and then gradually shifts more to training for another few decades on about 1e10 bits of higher value token/text data.
It is pretty likely that part of our apparent token/word data efficiency at abstract tasks does come from our everday physics sim capabilities which leverages the lower level vision/sensor modules trained on the larger ~1e15 bits (and many linguistics/philosophers were saying this long ago—the whole symbol grounding problem). And I agree with that. I suspect that is not the only source of our data efficiency, but yes I”m reasonably confident that AGI will require a much more human like curriculum training (with vision/sensor ‘pretraining’).
On the other hand we also have examples like hellen keller who place some rough limits on that transfer effect, and we have independent good reasons to believe the low level vision data is much more redundant (in part because the text stream is a compressed summary of what originally was low level vision/sensory data!).
Looking at it another way: this is the crux of human vs animal intelligence. An animal with similar lifespan and brain size (which are naturally correlated due to scaling laws!) only would have the 1e15 bits of sensory training data. Humans also curriculum train on 1e10 bits of a carefully curated subset of the total ~1e12 bits of accumulated human text-symbolic knowledge, which itself is a compression of the 1e26 bits of sensory data from all humans who have ever lived. Thus the intelligence of individual humans scales a bit with the total size of humanity, whereas it’s basically just constant for animals. Combine that with the exponential growth of human population and you get the observed hyperexponential trajectory leading to singularity around 2047. (prior LW discussion)
So the intelligence explosion people like Eliezer Yudkowsky and Luke Muehlhauser were possibly more right than they knew?
I’ll give them credit for predicting the idea that the future would be much weirder and unstable ala the Singularity long before Open Philanthropy saw the problem.
It also means stakes on the order of Pascal’s mugging are fairly likely this century, and we do live in the hinge of history.
I kinda hate summarizing EY (even EY-circa-2008) into a paragraph, but EY’s version of the intelligence explosion or singularity was focused heavily on recursively self improving AI that could quickly recode itself in ways humans presumably could not, and was influenced/associated with a pessimistic evolved modularity view of the brain that hasn’t aged well. Rapid takeoff, inefficient brains, evolved modularity, etc all tie together and self reinforce.
What has aged much better is the more systems-holistic singularity (moravec/kurzweil, john smart, etc) which credits (correctly) human intelligence to culture/language (human brains are just bog standard primate bains slightly scaled up 3x) - associated with softer takeoff as the AI advantage is mostly about allowing further exponential expansion of the size/population/complexity of the overall human memetic/cultural system. In this view recursive self improvement is just sort of baked in to acceleration rather than some new specific innovation of future AI, and AI itself is viewed as a continuation of humanity (little difference between de novo AI and uploads), rather than some new alien thing.