I mentioned “non-uniform neural architecture and hyperparameters”. I’m inclined to put different layer thicknesses (including agranularity) in the category of “non-uniform hyperparameters”.
If evolution could initialize certain parts of the cortex so that they are faster “up and running” why wouldn’t it?
If you buy the “locally-random pattern separation” story (Section 2.5.4), that would make it impossible for evolution to initialize the adjustable parameters in a non-locally-random way.
in terms of computation theory, learning from scratch is computationally intractable. Strong, informative priors over hypothesis space might just be necessary to learn anything worthwhile at all.
I’m very confused by this. I have coded up a ConvNet with random initialization. It was computationally tractable; in fact, it ran on my laptop!
I guess maybe what you’re claiming is: we can’t have all three of {learning from scratch, general intelligence, computational tractability}. If so, well, that’s a possible thing to believe, although I happen not to believe it. My question would be: why do you believe it? “Learning-from-scratch algorithms” consist of an astronomically large number of algorithms, of which an infinitesimal fraction have ever been even conceived of by humans. I think it’s difficult to make blanket statements about the whole category.
I don’t see the relevance of Solomonoff Induction here. “Generally intelligent” is a much lower bar than “just as intelligent as a Solomonoff Inductor”, right?
I’m also confused about why you think “strong, informative priors over hypothesis space” are not compatible with learning-from-scratch algorithms. The famous example everyone talks about is how ConvNets disproportionately search for patterns that are local (i.e. involve neighboring pixels) and translation-invariant.
What does “randomly initialized” even mean in the brain? At what point is the brain initialized?
Here’s an operationalization. Suppose someday we write computer code that can do the exact same useful computational things that the neocortex (etc.) does, for the exact same reason. My question is: Might that code look like a learning-from-scratch algorithm?
Here’s an operationalization. Suppose someday we write computer code that can do the exact same useful computational things that the neocortex (etc.) does, for the exact same reason. My question is: Might that code look like a learning-from-scratch algorithm?
Hmm, I see. If this is the crux, then I’ll put all the remaining nitpicking at the end of my comment and just say: I think I’m on board with your argument. Yes, it seems conceivable to me that a learning-from-scratch program ends up in a (functionally) very similar state to the brain. The trajectory of how the program ends up there over training probably looks different (and might take a bit longer if it doesn’t use the shortcuts that the brain got from evolution), but I don’t think the stuff that evolution put in the cortex is strictly necessary.
A caveat: I’m not sure how much weight the similarity between the program and the brain can support before it breaks down. I’d strongly suspect that certain aspects of the cortex are not logically implied by the statistics of the environment, but rather represent idiosyncratic quirks that were adapted at some point during evolution. Those idiosyncratic quirks won’t be in the learning-from-scratch program. But perhaps (probably?) they are also not relevant in the big scheme of things.
I’m inclined to put different layer thicknesses (including agranularity) in the category of “non-uniform hyperparameters”.
Fair! Most people in computational neuroscience are also very happy to ignore those differences, and so far nothing terribly bad happened.
If you buy the “locally-random pattern separation” story (Section 2.5.4), that would make it impossible for evolution to initialize the adjustable parameters in a non-locally-random way.
You point out yourself that some areas (f.e. the motor cortex) are granular, so that argument doesn’t work there. But ignoring that, and conceding the cerebellum and the drosophila mushroom body to you (not my area of expertise), I’m pretty doubtful about postulating “locally-random pattern separation” in the cortex. I’m interpreting your thesis to cash out as “Given a handful of granule cells from layer 4, the connectivity with pyramidal neurons in layer 2⁄3 is (initially) effectively random, and therefore layer 2⁄3 neurons need to learn (from scratch) how to interpret the signal from layer 4″. Is that an okay summary?
Because then I think this fails at three points:
One characteristic feature of the cortex is the presence of cortical maps. They exist in basically all sensory and motor cortices, and they have a very regular structure that is present in animal species separated by as much as 64 million years of evolution. These maps imply that if you pick a handful of granule cells from layer 4 that are located nearby, their functional properties will be somewhat similar! Therefore, even if connectivity between L4 and L2/3 is locally random it doesn’t really matter since the input is somewhat similar in any case. Evolution could “use” that fact to pre-structure the circuit in L2/3.
Connectivity between L4 and L2/3 is not random. Projections from layer 4 are specific to different portions of the postsynaptic dendrite, and nearby synapses on mature and developing dendrites tend to share similar activation patterns. Perhaps you want to argue that this non-randomness only emerges through learning and the initial configuration is random? That’s a possibility, but …
I’m very confused by this. I have coded up a ConvNet with random initialization. It was computationally tractable; in fact, it ran on my laptop!
Okay, my claim there came out a lot stronger than I wanted and I concede a lot of what you say. Learning from scratch is probably not computationally intractable in the technical sense. I guess what I wanted to argue is that it appears practically infeasible to learn everything from scratch. (There is a lot of “everything” and not a lot of time to learn it. Any headstart might be strictly necessary and not just a nice-to-have).
(As a side point: your choice of a convnet as the example is interesting. People came up with convnets because fully-connected, randomly initialized networks were not great at image classification and we needed some inductive bias in the form of a locality constraint to learn in a reasonable time. That’s the point I wanted to make.)
I guess maybe what you’re claiming is: we can’t have all three of {learning from scratch, general intelligence, computational tractability}.
Interesting, I haven’t thought about it like this before. I do think it could be possible to have all three—but then it’s not the brain anymore. As far as I can tell, evolutionary pressures make complete learning from scratch infeasible.
People came up with convnets because fully-connected, randomly initialized networks were not great at image classification and we needed some inductive bias in the form of a locality constraint to learn in a reasonable time. That’s the point I wanted to make.
I’m pretty confused here. To me, that doesn’t seem to support your point, which suggests that one of us is confused, or else I don’t understand your point.
Specifically: If I switch from a fully-connected DNN to a ConvNet, I’m switching from one learning-from-scratch algorithm to a different learning-from-scratch algorithm.
I feel like your perspective is that {inductive biases, non-learning-from-scratch} are a pair that go inexorably together, and you are strongly in favor of both, and I am strongly opposed to both. But that’s not right: they don’t inexorably go together. The ConvNet example proves it.
I am in favor of learning-from-scratch, and I am also in favor of specific designed inductive biases, and I don’t think those two things are in opposition to each other.
Yes, it seems conceivable to me that a learning-from-scratch program ends up in a (functionally) very similar state to the brain.
I think you’re misunderstanding me. Random chunks of matter do not learn language, but the neocortex does. There’s a reason for that—aspects of the neocortex are designed by evolution to do certain computations that result in the useful functionality of learning language (as an example). There is a reason that these particular computations, unlike the computations performed by random chunks of matter, are able to learn language. And this reason can be described in purely computational terms—”such-and-such process performs a kind of search over this particular space, and meanwhile this other process breaks down the syntactic tree using such-and-such algorithm…”, I dunno, whatever. The point is, this kind of explanation does not talk about subplates and synapses, it talks about principles of algorithms and computations.
Whatever that explanation is, it’s a thing that we can turn into a design spec for our own algorithms, which, powered by the same engineering principles, will do the same computations, with the same results.
In particular, our code will be just as data-efficient as the neocortex is, and it will make the same types of mistakes in the same types of situations as the neocortex does, etc. etc.
when you record activity from neurons in the cortex of an animal that had zero visual experience prior to the experiment (lid-suture), they are still orientation-selective
is that true even if there haven’t been any retinal waves?
Yeah, the feeling’s mutual 😅 But the discussion is also very rewarding for me, thank you for engaging!
I am in favor of learning-from-scratch, and I am also in favor of specific designed inductive biases, and I don’t think those two things are in opposition to each other.
A couple of thoughts:
Yes, I agree that the inductive bias (/genetically hardcoded information) can live in different components: the learning rule, the network architecture, or the initialization of the weights. So learning-from-scratch is logically compatible with inductive biases—we can just put all the inductive bias into the learning rule and the architecture and none in the weights.
But from the architecture and the learning rule, the hardcoded info can enter into the weights very rapidly (f.e. first step of the learning rule: set all the weights to the values appropriate for an adult brain. Or, more realistically, a ConvNet architecture can be learned from a DNN by setting a lot of connections to zero). Therefore I don’t see what it could buy you to assume the weights to be free of inductive bias.
There might also be a case that in the actual biological brain the weights are not initialized randomly. See f.e. this work on clonally related neurons.
Something that is not appreciated a lot outside of neuroscience: “Learning” in the brain is as much a structural process as it is a “changing weights” process. This is particularly true throughout development but also into adulthood—activity-dependent learning rules do not only adjust the weights of connections, but they can also prune bad connections and add new connections. The brain simultaneously produces activity, which induces plasticity, which changes the circuit, which produces slightly different activity in turn.
The point is, this kind of explanation does not talk about subplates and synapses, it talks about principles of algorithms and computations.
That sounds a lot more like cognitive science than neuroscience! This is completely fine (I did my undergrad in CogSci), but it requires a different set of arguments from the ones you are providing in your post, I think. If you want to make a CogSci case for learning from scratch, then your argument has to be a lot more constructive (i.e. literally walk us through the steps of how your proposed system can learn all/a lot of what humans can learn). Either you take a look at what is there in the brain (subplate, synapses, …), describe how these things interact, and (correctly) infer that it’s sufficient to produce a mind (this is the neuroscience strategy); Or you propose an abstract system, demonstrate that it can do the same thing as the mind, and then demonstrate that the components of the abstract system can be identified with the biological brain (this is the CogSci strategy). I think you’re skipping step two of the CogSci strategy.
Whatever that explanation is, it’s a thing that we can turn into a design spec for our own algorithms, which, powered by the same engineering principles, will do the same computations, with the same results.
I’m on board with that. I anticipate that the design spec will contain (the equivalent of) a ton of hardcoded genetic stuff also for the “learning subsystem”/cortex. From a CogSci perspective, I’m willing to assume that this genetic stuff could be in the learning rule and the architecture, not in the initial weights. From a neuroscience perspective, I’m not convinced that’s the case.
is that true even if there haven’t been any retinal waves?
Thanks!
Not even him! Jeff Hawkins: “Mountcastle’s proposal that there is a common cortical algorithm doesn’t mean there are no variations. He knew that. The issue is how much is common in all cortical regions, and how much is different. The evidence suggests that there is a huge amount of commonality.”
I mentioned “non-uniform neural architecture and hyperparameters”. I’m inclined to put different layer thicknesses (including agranularity) in the category of “non-uniform hyperparameters”.
If you buy the “locally-random pattern separation” story (Section 2.5.4), that would make it impossible for evolution to initialize the adjustable parameters in a non-locally-random way.
I’m very confused by this. I have coded up a ConvNet with random initialization. It was computationally tractable; in fact, it ran on my laptop!
I guess maybe what you’re claiming is: we can’t have all three of {learning from scratch, general intelligence, computational tractability}. If so, well, that’s a possible thing to believe, although I happen not to believe it. My question would be: why do you believe it? “Learning-from-scratch algorithms” consist of an astronomically large number of algorithms, of which an infinitesimal fraction have ever been even conceived of by humans. I think it’s difficult to make blanket statements about the whole category.
I don’t see the relevance of Solomonoff Induction here. “Generally intelligent” is a much lower bar than “just as intelligent as a Solomonoff Inductor”, right?
I’m also confused about why you think “strong, informative priors over hypothesis space” are not compatible with learning-from-scratch algorithms. The famous example everyone talks about is how ConvNets disproportionately search for patterns that are local (i.e. involve neighboring pixels) and translation-invariant.
Here’s an operationalization. Suppose someday we write computer code that can do the exact same useful computational things that the neocortex (etc.) does, for the exact same reason. My question is: Might that code look like a learning-from-scratch algorithm?
Hmm, I see. If this is the crux, then I’ll put all the remaining nitpicking at the end of my comment and just say: I think I’m on board with your argument. Yes, it seems conceivable to me that a learning-from-scratch program ends up in a (functionally) very similar state to the brain. The trajectory of how the program ends up there over training probably looks different (and might take a bit longer if it doesn’t use the shortcuts that the brain got from evolution), but I don’t think the stuff that evolution put in the cortex is strictly necessary.
A caveat: I’m not sure how much weight the similarity between the program and the brain can support before it breaks down. I’d strongly suspect that certain aspects of the cortex are not logically implied by the statistics of the environment, but rather represent idiosyncratic quirks that were adapted at some point during evolution. Those idiosyncratic quirks won’t be in the learning-from-scratch program. But perhaps (probably?) they are also not relevant in the big scheme of things.
Fair! Most people in computational neuroscience are also very happy to ignore those differences, and so far nothing terribly bad happened.
You point out yourself that some areas (f.e. the motor cortex) are granular, so that argument doesn’t work there. But ignoring that, and conceding the cerebellum and the drosophila mushroom body to you (not my area of expertise), I’m pretty doubtful about postulating “locally-random pattern separation” in the cortex. I’m interpreting your thesis to cash out as “Given a handful of granule cells from layer 4, the connectivity with pyramidal neurons in layer 2⁄3 is (initially) effectively random, and therefore layer 2⁄3 neurons need to learn (from scratch) how to interpret the signal from layer 4″. Is that an okay summary?
Because then I think this fails at three points:
One characteristic feature of the cortex is the presence of cortical maps. They exist in basically all sensory and motor cortices, and they have a very regular structure that is present in animal species separated by as much as 64 million years of evolution. These maps imply that if you pick a handful of granule cells from layer 4 that are located nearby, their functional properties will be somewhat similar! Therefore, even if connectivity between L4 and L2/3 is locally random it doesn’t really matter since the input is somewhat similar in any case. Evolution could “use” that fact to pre-structure the circuit in L2/3.
Connectivity between L4 and L2/3 is not random. Projections from layer 4 are specific to different portions of the postsynaptic dendrite, and nearby synapses on mature and developing dendrites tend to share similar activation patterns. Perhaps you want to argue that this non-randomness only emerges through learning and the initial configuration is random? That’s a possibility, but …
… when you record activity from neurons in the cortex of an animal that had zero visual experience prior to the experiment (lid-suture), they are still orientation-selective! And so is the topographic arrangement of retinal inputs and the segregation of eye-specific inputs. At the point of eye-opening, the animals are already pretty much able to navigate their environment.
Obviously, there are still a lot of things that need to be refined and set up during later development, but defects in these early stages of network initialization are pretty bad (a lot of neurodevelopmental disorders manifest as “wiring defects” that start in early development).
Okay, my claim there came out a lot stronger than I wanted and I concede a lot of what you say. Learning from scratch is probably not computationally intractable in the technical sense. I guess what I wanted to argue is that it appears practically infeasible to learn everything from scratch. (There is a lot of “everything” and not a lot of time to learn it. Any headstart might be strictly necessary and not just a nice-to-have).
(As a side point: your choice of a convnet as the example is interesting. People came up with convnets because fully-connected, randomly initialized networks were not great at image classification and we needed some inductive bias in the form of a locality constraint to learn in a reasonable time. That’s the point I wanted to make.)
Interesting, I haven’t thought about it like this before. I do think it could be possible to have all three—but then it’s not the brain anymore. As far as I can tell, evolutionary pressures make complete learning from scratch infeasible.
Thanks for your interesting comments!
I’m pretty confused here. To me, that doesn’t seem to support your point, which suggests that one of us is confused, or else I don’t understand your point.
Specifically: If I switch from a fully-connected DNN to a ConvNet, I’m switching from one learning-from-scratch algorithm to a different learning-from-scratch algorithm.
I feel like your perspective is that {inductive biases, non-learning-from-scratch} are a pair that go inexorably together, and you are strongly in favor of both, and I am strongly opposed to both. But that’s not right: they don’t inexorably go together. The ConvNet example proves it.
I am in favor of learning-from-scratch, and I am also in favor of specific designed inductive biases, and I don’t think those two things are in opposition to each other.
I think you’re misunderstanding me. Random chunks of matter do not learn language, but the neocortex does. There’s a reason for that—aspects of the neocortex are designed by evolution to do certain computations that result in the useful functionality of learning language (as an example). There is a reason that these particular computations, unlike the computations performed by random chunks of matter, are able to learn language. And this reason can be described in purely computational terms—”such-and-such process performs a kind of search over this particular space, and meanwhile this other process breaks down the syntactic tree using such-and-such algorithm…”, I dunno, whatever. The point is, this kind of explanation does not talk about subplates and synapses, it talks about principles of algorithms and computations.
Whatever that explanation is, it’s a thing that we can turn into a design spec for our own algorithms, which, powered by the same engineering principles, will do the same computations, with the same results.
In particular, our code will be just as data-efficient as the neocortex is, and it will make the same types of mistakes in the same types of situations as the neocortex does, etc. etc.
is that true even if there haven’t been any retinal waves?
Yeah, the feeling’s mutual 😅 But the discussion is also very rewarding for me, thank you for engaging!
A couple of thoughts:
Yes, I agree that the inductive bias (/genetically hardcoded information) can live in different components: the learning rule, the network architecture, or the initialization of the weights. So learning-from-scratch is logically compatible with inductive biases—we can just put all the inductive bias into the learning rule and the architecture and none in the weights.
But from the architecture and the learning rule, the hardcoded info can enter into the weights very rapidly (f.e. first step of the learning rule: set all the weights to the values appropriate for an adult brain. Or, more realistically, a ConvNet architecture can be learned from a DNN by setting a lot of connections to zero). Therefore I don’t see what it could buy you to assume the weights to be free of inductive bias.
There might also be a case that in the actual biological brain the weights are not initialized randomly. See f.e. this work on clonally related neurons.
Something that is not appreciated a lot outside of neuroscience: “Learning” in the brain is as much a structural process as it is a “changing weights” process. This is particularly true throughout development but also into adulthood—activity-dependent learning rules do not only adjust the weights of connections, but they can also prune bad connections and add new connections. The brain simultaneously produces activity, which induces plasticity, which changes the circuit, which produces slightly different activity in turn.
That sounds a lot more like cognitive science than neuroscience! This is completely fine (I did my undergrad in CogSci), but it requires a different set of arguments from the ones you are providing in your post, I think. If you want to make a CogSci case for learning from scratch, then your argument has to be a lot more constructive (i.e. literally walk us through the steps of how your proposed system can learn all/a lot of what humans can learn). Either you take a look at what is there in the brain (subplate, synapses, …), describe how these things interact, and (correctly) infer that it’s sufficient to produce a mind (this is the neuroscience strategy); Or you propose an abstract system, demonstrate that it can do the same thing as the mind, and then demonstrate that the components of the abstract system can be identified with the biological brain (this is the CogSci strategy). I think you’re skipping step two of the CogSci strategy.
I’m on board with that. I anticipate that the design spec will contain (the equivalent of) a ton of hardcoded genetic stuff also for the “learning subsystem”/cortex. From a CogSci perspective, I’m willing to assume that this genetic stuff could be in the learning rule and the architecture, not in the initial weights. From a neuroscience perspective, I’m not convinced that’s the case.
Blocking retinal waves messes up the cortex pretty substantially (same as if the animal were born without eyes). There is the beta-2 knockout mouse, which has retinal waves but with weaker spatiotemporal correlations. As a consequence beta-2 mice fail to track moving gratings and have disrupted receptive fields.