The paradigm of the brain is online learning. There are a “small” number of adjustable parameters on how the process is set up, and then each run is long—a billion subjective seconds. And during the run there are a “large” number of adjustable parameters that get adjusted. Almost all the information content comes within a single run.
The paradigm of today’s popular ML approaches is train-then-infer. There are a “large” number of adjustable parameters, which are adjusted over the course of an extremely large number of extremely short runs. Almost all the information content comes from the training process, not within the run. Meanwhile, sometimes people do multiple model-training runs with different hyperparameters—hyperparameters are a “small” number of adjustable parameters that sit outside the gradient-descent training loop.
I think the appropriate analogy is:
(A) Brain: One (billion-subjective-second) run ↔ ML: One gradient-descent model training
(B) Brain: Adjustable parameters on the genome ↔ ML: Hyperparameters
(C) Brain: Settings of synapses (or potential synapses) in a particular adult ↔ ML: parameter settings of a fully-trained model
This seems to work reasonably well all around: (A) takes a long time and involves a lot of information content in the developed “intelligence”, (B) is a handful of (perhaps human-interpretable) parameters, (C) is the final “intelligence” that you wind up wanting to deploy.
So again I would analogize one run of the online-learning paradigm with one training of today’s popular ML approaches. Then I would try to guess how many runs of online-learning you need, and I would guess 10-100, not based on anything in particular, but you can get a better number by looking into the extent to which people need to play with hyperparameters in their ML training, which is “not much if it’s very important not to”.
Sure, you can do a boil-the-oceans automated hyperparameter search, but in the biggest projects where you have no compute to spare, they can’t do that. Instead, you sit and think about the hyperparameters, you do smaller-scale studies, you try to carefully diagnose the results of each training, etc. etc. Like, GPT-3 only did one training of their largest model, I believe—they worked hard to figure out good hyperparameter settings by extrapolating from smaller studies.
...Whereas it seems that the report is doing a different analogy:
(A) Brain: One (billion-subjective-second) run ↔ ML: One run during training (one play of an Atari game etc.)
(B) Brain: Adjustable parameters on the genome ↔ ML: Learnable parameters in the model
(C) Brain: Many (billion-subjective-second) runs ↔ ML: One model-training session
I think that analogy is much worse than the one I proposed. You’re mixing short tests with long-calculations-that-involve-a-ton-of-learning, you’re mixing human tweaking of understandable parameters with gradient descent, etc.
To be clear, I don’t think my proposed analogy is perfect, because I think that brain algorithms are rather different than today’s ML algorithms. But I think it’s a lot better than what’s there now, and maybe it’s the best you can do without getting into highly speculative and controversial inside-view-about-brain-algorithms stuff.
I like this comment, and more generally I feel like there’s more information to be gained from clarifying the analogies to evolution, and gaining clarity on when it’s possible for researchers to tune hyperparameters with shortcuts, vs. cases where they’d have to “boil the oceans.”
Do you have a rough sense on how using your analogy would affect the timeline estimates?
Using Steve’s analogy would make for much shorter timeline estimates. Steve guesses 10-100 runs of online-learning needed, i.e. 10-100 iterations to find the right hyperparameters before you get a training run that produces something actually smart like a human. This is only 1-2 orders of magnitude more compute than the human-brain-human-lifetime anchor, which is the nearest anchor (and which Ajeya assigns only 5% credence to!) Eyeballing the charts it looks like you’d end up with something like 50% probability by 2035, holding fixed all of Ajeya’s other assumptions.
This is a separate point from yours, but one thing I’m skeptical about is the following:
The Genome Anchor takes the information in the human genome and looks at it as a kind of compression of brain architectures, right? But that wouldn’t seem right to me. By itself, a genome is quite useless. If we had the DNA of a small dinosaur today, we probably couldn’t just use ostriches as surrogate mothers. The way the genome encodes information is tightly linked to the rest of an organism’s biology, particularly its cellular machinery and hormonal features in the womb. The genome is just one half of the encoding, and if we don’t get the rest right, it all gets scrambled.
Edit: OK here’s an argument why my point is flawed: Once you have the right type of womb, all the variation in a species’ gene pool can be expressed phenotypically out of just one womb prototype. This suggests that the vast majority of the information is just in the genome.
When I imagine brain architecture information I imagine “nerve fiber tract #17 should connect region 182 neuron type F to region 629 neuron type N” and when I imagine brain semantic information I imagine “neuron #526853 should connect to dendrite branch 245 of neuron #674208″. I don’t immediately see how either of these types of things could come from the womb (it’s not like there’s an Ethernet cable in there), except that the brain can learn in the womb environment just like it can learn in every other environment.
Once you have the right type of womb, all the variation in a species’ gene pool can be expressed phenotypically out of just one womb prototype.
Not sure that argument proves much; could also be that the vast majority of the information is the same for all humans.
We do have cases of very preterm infants turning out neurologically normal. I guess that only proves that no womb magic happens in the last 10-15 weeks of gestation.
Let me try again. Maybe this will be clearer.
The paradigm of the brain is online learning. There are a “small” number of adjustable parameters on how the process is set up, and then each run is long—a billion subjective seconds. And during the run there are a “large” number of adjustable parameters that get adjusted. Almost all the information content comes within a single run.
The paradigm of today’s popular ML approaches is train-then-infer. There are a “large” number of adjustable parameters, which are adjusted over the course of an extremely large number of extremely short runs. Almost all the information content comes from the training process, not within the run. Meanwhile, sometimes people do multiple model-training runs with different hyperparameters—hyperparameters are a “small” number of adjustable parameters that sit outside the gradient-descent training loop.
I think the appropriate analogy is:
(A) Brain: One (billion-subjective-second) run ↔ ML: One gradient-descent model training
(B) Brain: Adjustable parameters on the genome ↔ ML: Hyperparameters
(C) Brain: Settings of synapses (or potential synapses) in a particular adult ↔ ML: parameter settings of a fully-trained model
This seems to work reasonably well all around: (A) takes a long time and involves a lot of information content in the developed “intelligence”, (B) is a handful of (perhaps human-interpretable) parameters, (C) is the final “intelligence” that you wind up wanting to deploy.
So again I would analogize one run of the online-learning paradigm with one training of today’s popular ML approaches. Then I would try to guess how many runs of online-learning you need, and I would guess 10-100, not based on anything in particular, but you can get a better number by looking into the extent to which people need to play with hyperparameters in their ML training, which is “not much if it’s very important not to”.
Sure, you can do a boil-the-oceans automated hyperparameter search, but in the biggest projects where you have no compute to spare, they can’t do that. Instead, you sit and think about the hyperparameters, you do smaller-scale studies, you try to carefully diagnose the results of each training, etc. etc. Like, GPT-3 only did one training of their largest model, I believe—they worked hard to figure out good hyperparameter settings by extrapolating from smaller studies.
...Whereas it seems that the report is doing a different analogy:
(A) Brain: One (billion-subjective-second) run ↔ ML: One run during training (one play of an Atari game etc.)
(B) Brain: Adjustable parameters on the genome ↔ ML: Learnable parameters in the model
(C) Brain: Many (billion-subjective-second) runs ↔ ML: One model-training session
I think that analogy is much worse than the one I proposed. You’re mixing short tests with long-calculations-that-involve-a-ton-of-learning, you’re mixing human tweaking of understandable parameters with gradient descent, etc.
To be clear, I don’t think my proposed analogy is perfect, because I think that brain algorithms are rather different than today’s ML algorithms. But I think it’s a lot better than what’s there now, and maybe it’s the best you can do without getting into highly speculative and controversial inside-view-about-brain-algorithms stuff.
I could be wrong or confused :-)
I like this comment, and more generally I feel like there’s more information to be gained from clarifying the analogies to evolution, and gaining clarity on when it’s possible for researchers to tune hyperparameters with shortcuts, vs. cases where they’d have to “boil the oceans.”
Do you have a rough sense on how using your analogy would affect the timeline estimates?
Using Steve’s analogy would make for much shorter timeline estimates. Steve guesses 10-100 runs of online-learning needed, i.e. 10-100 iterations to find the right hyperparameters before you get a training run that produces something actually smart like a human. This is only 1-2 orders of magnitude more compute than the human-brain-human-lifetime anchor, which is the nearest anchor (and which Ajeya assigns only 5% credence to!) Eyeballing the charts it looks like you’d end up with something like 50% probability by 2035, holding fixed all of Ajeya’s other assumptions.
Thanks! I don’t know off the top of my head, sorry.
This is a separate point from yours, but one thing I’m skeptical about is the following:
The Genome Anchor takes the information in the human genome and looks at it as a kind of compression of brain architectures, right? But that wouldn’t seem right to me. By itself, a genome is quite useless. If we had the DNA of a small dinosaur today, we probably couldn’t just use ostriches as surrogate mothers. The way the genome encodes information is tightly linked to the rest of an organism’s biology, particularly its cellular machinery and hormonal features in the womb. The genome is just one half of the encoding, and if we don’t get the rest right, it all gets scrambled.
Edit: OK here’s an argument why my point is flawed: Once you have the right type of womb, all the variation in a species’ gene pool can be expressed phenotypically out of just one womb prototype. This suggests that the vast majority of the information is just in the genome.
When I imagine brain architecture information I imagine “nerve fiber tract #17 should connect region 182 neuron type F to region 629 neuron type N” and when I imagine brain semantic information I imagine “neuron #526853 should connect to dendrite branch 245 of neuron #674208″. I don’t immediately see how either of these types of things could come from the womb (it’s not like there’s an Ethernet cable in there), except that the brain can learn in the womb environment just like it can learn in every other environment.
Not sure that argument proves much; could also be that the vast majority of the information is the same for all humans.
We do have cases of very preterm infants turning out neurologically normal. I guess that only proves that no womb magic happens in the last 10-15 weeks of gestation.