I’m not seeing the merit of the genome anchor. I see how it would make sense if humans didn’t learn anything over the course of their lifetime. Then all the inference-time algorithmic complexity would come from the genome, and you would need your ML process to search over a space of models that can express that complexity. However, needless to say, humans do learn things over the course of their lifetime! I feel even more strongly about that than most, but I imagine we can all agree that the inference-time algorithmic complexity of an adult brain is not limited by what’s in the genome, but rather also incorporates information from self-supervised learning etc.
The opposite perspective would say: the analogy isn’t between the ML trained model and the genome, but rather between the ML learning algorithm and the genome on one level, and between the ML trained model and the synapses at the other level. So, something like ML parameter count = synapse count, and meanwhile the genome size would correspond to “how complicated is the architecture and learning algorithm?”—like, add up the algorithmic complexity of backprop plus dropout regularization plus BatchNorm plus data augmentation plus xavier initialization etc. etc. Or something like that.
I think the truth is somewhere in between, but a lot closer to the synapse-anchor side (that ignores instincts) than the genome-anchor side (that ignores learning), I think...
Sorry if I’m misunderstanding or missing something, or confused.
UPDATE: Or are we supposed to imagine an RNN wherein the genomic information corresponds to the weights, and the synapse information corresponds to the hidden state activations? If so, I didn’t think you could design an RNN (of the type typically used today) where the hidden state activations have many orders of magnitude more information content than the weights. Usually there are more weights than hidden state activations, right?
The paradigm of the brain is online learning. There are a “small” number of adjustable parameters on how the process is set up, and then each run is long—a billion subjective seconds. And during the run there are a “large” number of adjustable parameters that get adjusted. Almost all the information content comes within a single run.
The paradigm of today’s popular ML approaches is train-then-infer. There are a “large” number of adjustable parameters, which are adjusted over the course of an extremely large number of extremely short runs. Almost all the information content comes from the training process, not within the run. Meanwhile, sometimes people do multiple model-training runs with different hyperparameters—hyperparameters are a “small” number of adjustable parameters that sit outside the gradient-descent training loop.
I think the appropriate analogy is:
(A) Brain: One (billion-subjective-second) run ↔ ML: One gradient-descent model training
(B) Brain: Adjustable parameters on the genome ↔ ML: Hyperparameters
(C) Brain: Settings of synapses (or potential synapses) in a particular adult ↔ ML: parameter settings of a fully-trained model
This seems to work reasonably well all around: (A) takes a long time and involves a lot of information content in the developed “intelligence”, (B) is a handful of (perhaps human-interpretable) parameters, (C) is the final “intelligence” that you wind up wanting to deploy.
So again I would analogize one run of the online-learning paradigm with one training of today’s popular ML approaches. Then I would try to guess how many runs of online-learning you need, and I would guess 10-100, not based on anything in particular, but you can get a better number by looking into the extent to which people need to play with hyperparameters in their ML training, which is “not much if it’s very important not to”.
Sure, you can do a boil-the-oceans automated hyperparameter search, but in the biggest projects where you have no compute to spare, they can’t do that. Instead, you sit and think about the hyperparameters, you do smaller-scale studies, you try to carefully diagnose the results of each training, etc. etc. Like, GPT-3 only did one training of their largest model, I believe—they worked hard to figure out good hyperparameter settings by extrapolating from smaller studies.
...Whereas it seems that the report is doing a different analogy:
(A) Brain: One (billion-subjective-second) run ↔ ML: One run during training (one play of an Atari game etc.)
(B) Brain: Adjustable parameters on the genome ↔ ML: Learnable parameters in the model
(C) Brain: Many (billion-subjective-second) runs ↔ ML: One model-training session
I think that analogy is much worse than the one I proposed. You’re mixing short tests with long-calculations-that-involve-a-ton-of-learning, you’re mixing human tweaking of understandable parameters with gradient descent, etc.
To be clear, I don’t think my proposed analogy is perfect, because I think that brain algorithms are rather different than today’s ML algorithms. But I think it’s a lot better than what’s there now, and maybe it’s the best you can do without getting into highly speculative and controversial inside-view-about-brain-algorithms stuff.
I like this comment, and more generally I feel like there’s more information to be gained from clarifying the analogies to evolution, and gaining clarity on when it’s possible for researchers to tune hyperparameters with shortcuts, vs. cases where they’d have to “boil the oceans.”
Do you have a rough sense on how using your analogy would affect the timeline estimates?
Using Steve’s analogy would make for much shorter timeline estimates. Steve guesses 10-100 runs of online-learning needed, i.e. 10-100 iterations to find the right hyperparameters before you get a training run that produces something actually smart like a human. This is only 1-2 orders of magnitude more compute than the human-brain-human-lifetime anchor, which is the nearest anchor (and which Ajeya assigns only 5% credence to!) Eyeballing the charts it looks like you’d end up with something like 50% probability by 2035, holding fixed all of Ajeya’s other assumptions.
This is a separate point from yours, but one thing I’m skeptical about is the following:
The Genome Anchor takes the information in the human genome and looks at it as a kind of compression of brain architectures, right? But that wouldn’t seem right to me. By itself, a genome is quite useless. If we had the DNA of a small dinosaur today, we probably couldn’t just use ostriches as surrogate mothers. The way the genome encodes information is tightly linked to the rest of an organism’s biology, particularly its cellular machinery and hormonal features in the womb. The genome is just one half of the encoding, and if we don’t get the rest right, it all gets scrambled.
Edit: OK here’s an argument why my point is flawed: Once you have the right type of womb, all the variation in a species’ gene pool can be expressed phenotypically out of just one womb prototype. This suggests that the vast majority of the information is just in the genome.
When I imagine brain architecture information I imagine “nerve fiber tract #17 should connect region 182 neuron type F to region 629 neuron type N” and when I imagine brain semantic information I imagine “neuron #526853 should connect to dendrite branch 245 of neuron #674208″. I don’t immediately see how either of these types of things could come from the womb (it’s not like there’s an Ethernet cable in there), except that the brain can learn in the womb environment just like it can learn in every other environment.
Once you have the right type of womb, all the variation in a species’ gene pool can be expressed phenotypically out of just one womb prototype.
Not sure that argument proves much; could also be that the vast majority of the information is the same for all humans.
We do have cases of very preterm infants turning out neurologically normal. I guess that only proves that no womb magic happens in the last 10-15 weeks of gestation.
I’m not seeing the merit of the genome anchor. I see how it would make sense if humans didn’t learn anything over the course of their lifetime. Then all the inference-time algorithmic complexity would come from the genome, and you would need your ML process to search over a space of models that can express that complexity. However, needless to say, humans do learn things over the course of their lifetime! I feel even more strongly about that than most, but I imagine we can all agree that the inference-time algorithmic complexity of an adult brain is not limited by what’s in the genome, but rather also incorporates information from self-supervised learning etc.
The opposite perspective would say: the analogy isn’t between the ML trained model and the genome, but rather between the ML learning algorithm and the genome on one level, and between the ML trained model and the synapses at the other level. So, something like ML parameter count = synapse count, and meanwhile the genome size would correspond to “how complicated is the architecture and learning algorithm?”—like, add up the algorithmic complexity of backprop plus dropout regularization plus BatchNorm plus data augmentation plus xavier initialization etc. etc. Or something like that.
I think the truth is somewhere in between, but a lot closer to the synapse-anchor side (that ignores instincts) than the genome-anchor side (that ignores learning), I think...
Sorry if I’m misunderstanding or missing something, or confused.
UPDATE: Or are we supposed to imagine an RNN wherein the genomic information corresponds to the weights, and the synapse information corresponds to the hidden state activations? If so, I didn’t think you could design an RNN (of the type typically used today) where the hidden state activations have many orders of magnitude more information content than the weights. Usually there are more weights than hidden state activations, right?
UPDATE 2: See my reply to this comment.
Let me try again. Maybe this will be clearer.
The paradigm of the brain is online learning. There are a “small” number of adjustable parameters on how the process is set up, and then each run is long—a billion subjective seconds. And during the run there are a “large” number of adjustable parameters that get adjusted. Almost all the information content comes within a single run.
The paradigm of today’s popular ML approaches is train-then-infer. There are a “large” number of adjustable parameters, which are adjusted over the course of an extremely large number of extremely short runs. Almost all the information content comes from the training process, not within the run. Meanwhile, sometimes people do multiple model-training runs with different hyperparameters—hyperparameters are a “small” number of adjustable parameters that sit outside the gradient-descent training loop.
I think the appropriate analogy is:
(A) Brain: One (billion-subjective-second) run ↔ ML: One gradient-descent model training
(B) Brain: Adjustable parameters on the genome ↔ ML: Hyperparameters
(C) Brain: Settings of synapses (or potential synapses) in a particular adult ↔ ML: parameter settings of a fully-trained model
This seems to work reasonably well all around: (A) takes a long time and involves a lot of information content in the developed “intelligence”, (B) is a handful of (perhaps human-interpretable) parameters, (C) is the final “intelligence” that you wind up wanting to deploy.
So again I would analogize one run of the online-learning paradigm with one training of today’s popular ML approaches. Then I would try to guess how many runs of online-learning you need, and I would guess 10-100, not based on anything in particular, but you can get a better number by looking into the extent to which people need to play with hyperparameters in their ML training, which is “not much if it’s very important not to”.
Sure, you can do a boil-the-oceans automated hyperparameter search, but in the biggest projects where you have no compute to spare, they can’t do that. Instead, you sit and think about the hyperparameters, you do smaller-scale studies, you try to carefully diagnose the results of each training, etc. etc. Like, GPT-3 only did one training of their largest model, I believe—they worked hard to figure out good hyperparameter settings by extrapolating from smaller studies.
...Whereas it seems that the report is doing a different analogy:
(A) Brain: One (billion-subjective-second) run ↔ ML: One run during training (one play of an Atari game etc.)
(B) Brain: Adjustable parameters on the genome ↔ ML: Learnable parameters in the model
(C) Brain: Many (billion-subjective-second) runs ↔ ML: One model-training session
I think that analogy is much worse than the one I proposed. You’re mixing short tests with long-calculations-that-involve-a-ton-of-learning, you’re mixing human tweaking of understandable parameters with gradient descent, etc.
To be clear, I don’t think my proposed analogy is perfect, because I think that brain algorithms are rather different than today’s ML algorithms. But I think it’s a lot better than what’s there now, and maybe it’s the best you can do without getting into highly speculative and controversial inside-view-about-brain-algorithms stuff.
I could be wrong or confused :-)
I like this comment, and more generally I feel like there’s more information to be gained from clarifying the analogies to evolution, and gaining clarity on when it’s possible for researchers to tune hyperparameters with shortcuts, vs. cases where they’d have to “boil the oceans.”
Do you have a rough sense on how using your analogy would affect the timeline estimates?
Using Steve’s analogy would make for much shorter timeline estimates. Steve guesses 10-100 runs of online-learning needed, i.e. 10-100 iterations to find the right hyperparameters before you get a training run that produces something actually smart like a human. This is only 1-2 orders of magnitude more compute than the human-brain-human-lifetime anchor, which is the nearest anchor (and which Ajeya assigns only 5% credence to!) Eyeballing the charts it looks like you’d end up with something like 50% probability by 2035, holding fixed all of Ajeya’s other assumptions.
Thanks! I don’t know off the top of my head, sorry.
This is a separate point from yours, but one thing I’m skeptical about is the following:
The Genome Anchor takes the information in the human genome and looks at it as a kind of compression of brain architectures, right? But that wouldn’t seem right to me. By itself, a genome is quite useless. If we had the DNA of a small dinosaur today, we probably couldn’t just use ostriches as surrogate mothers. The way the genome encodes information is tightly linked to the rest of an organism’s biology, particularly its cellular machinery and hormonal features in the womb. The genome is just one half of the encoding, and if we don’t get the rest right, it all gets scrambled.
Edit: OK here’s an argument why my point is flawed: Once you have the right type of womb, all the variation in a species’ gene pool can be expressed phenotypically out of just one womb prototype. This suggests that the vast majority of the information is just in the genome.
When I imagine brain architecture information I imagine “nerve fiber tract #17 should connect region 182 neuron type F to region 629 neuron type N” and when I imagine brain semantic information I imagine “neuron #526853 should connect to dendrite branch 245 of neuron #674208″. I don’t immediately see how either of these types of things could come from the womb (it’s not like there’s an Ethernet cable in there), except that the brain can learn in the womb environment just like it can learn in every other environment.
Not sure that argument proves much; could also be that the vast majority of the information is the same for all humans.
We do have cases of very preterm infants turning out neurologically normal. I guess that only proves that no womb magic happens in the last 10-15 weeks of gestation.
Potentially worth noting that if you add the lifetime anchor to the genome anchor, you most likely get ~the genome anchor.