To clarify, the main difficulty I see here is that this isn’t actually like training n networks of size N/n, because you’re still using the original loss function.
Your optimiser doesn’t get to see how well each module is performing individually, only their aggregate performance. So if module three is doing great, but module five is doing abysmally, and the answer depends on both being right, your loss is really bad. So the optimiser is going to happily modify three away from the optimum it doesn’t know it’s in.
Nevertheless, I think there could be something to the basic intuition of fine tuning just getting more and more difficult for the optimiser as you increase the parameter count, and with it the number of interaction terms. Until the only way to find anything good anymore is to just set a bunch of those interactions to zero.
This would predict that in 2005-style NNs with tiny parameter counts, you would have no modularity. In real biology, with far more interacting parts, you would have modularity. And in modern deep learning nets with billions of parameters, you would also have modularity. This matches what we observe. Really neatly and simply too.
It’s also dead easy to test. Just make a CNN or something and see how modularity scales with parameter count. This is now definitely on our to do list.
Reasoning: Training independent parts to each perform some specific sub-calculation should be easier than training the whole system at once.
Since I’ve not been involved in this discussion for as long I’ll probably miss some subtlety here, but my immediate reaction is that “easier” might depend on your perspective—if you’re explicitly enforcing modularity in the architecture (e.g. see the “Direct selection for modularity” section of our other post) then I agree it would be a lot easier, but whether modular systems are selected for when they’re being trained on factorisable tasks is kinda the whole question. Since sections of biological networks do sometimes evolve completely in isolation from each other (because they’re literally physically connected) then it does seem plausible that something like this is happening, but it doesn’t really move us closer to a gears-level model for what’s causing modularity to be selected for in the first place. I imagine I’m misunderstanding something here though.
So if module three is doing great, but module five is doing abysmally, and the answer depends on both being right, your loss is really bad. So the optimiser is going to happily modify three away from the optimum it doesn’t know it’s in.
Maybe one way to get around it is that the loss function might not just be a function of the final outputs of each subnetwork combined, it might also reward bits of subcomputation? e.g. to take a deep learning example which we’ve discussed, suppose you were training a CNN to calculate the sum of 2 MNIST digits, and you were hoping the CNN would develop a modular representation of these two digits plus an “adding function”—maybe the network could also be rewarded for the subtask of recognising the individual digits? It seems somewhat plausible to me that this kind of thing happens in biology, otherwise there would be too many evolutionary hurdles to jump before you get a minimum viable product. As an example, the eye is a highly complex and modular structure, but the very first eyes were basically just photoreceptors that detected areas of bright light (making it easier to navigate in the water, and hide from predators I think). So at first the loss function wasn’t so picky as to only tolerate perfect image reconstructions of the organism’s surroundings; instead it simply graded good brightness-detection, which I think could today be regarded as one of the “factorised tasks” of vision (although I’m not sure about this).
To clarify, the main difficulty I see here is that this isn’t actually like training n networks of size N/n, because you’re still using the original loss function.
Your optimiser doesn’t get to see how well each module is performing individually, only their aggregate performance. So if module three is doing great, but module five is doing abysmally, and the answer depends on both being right, your loss is really bad. So the optimiser is going to happily modify three away from the optimum it doesn’t know it’s in.
Nevertheless, I think there could be something to the basic intuition of fine tuning just getting more and more difficult for the optimiser as you increase the parameter count, and with it the number of interaction terms. Until the only way to find anything good anymore is to just set a bunch of those interactions to zero.
This would predict that in 2005-style NNs with tiny parameter counts, you would have no modularity. In real biology, with far more interacting parts, you would have modularity. And in modern deep learning nets with billions of parameters, you would also have modularity. This matches what we observe. Really neatly and simply too.
It’s also dead easy to test. Just make a CNN or something and see how modularity scales with parameter count. This is now definitely on our to do list.
Thanks a lot again, Simon!
Since I’ve not been involved in this discussion for as long I’ll probably miss some subtlety here, but my immediate reaction is that “easier” might depend on your perspective—if you’re explicitly enforcing modularity in the architecture (e.g. see the “Direct selection for modularity” section of our other post) then I agree it would be a lot easier, but whether modular systems are selected for when they’re being trained on factorisable tasks is kinda the whole question. Since sections of biological networks do sometimes evolve completely in isolation from each other (because they’re literally physically connected) then it does seem plausible that something like this is happening, but it doesn’t really move us closer to a gears-level model for what’s causing modularity to be selected for in the first place. I imagine I’m misunderstanding something here though.
Maybe one way to get around it is that the loss function might not just be a function of the final outputs of each subnetwork combined, it might also reward bits of subcomputation? e.g. to take a deep learning example which we’ve discussed, suppose you were training a CNN to calculate the sum of 2 MNIST digits, and you were hoping the CNN would develop a modular representation of these two digits plus an “adding function”—maybe the network could also be rewarded for the subtask of recognising the individual digits? It seems somewhat plausible to me that this kind of thing happens in biology, otherwise there would be too many evolutionary hurdles to jump before you get a minimum viable product. As an example, the eye is a highly complex and modular structure, but the very first eyes were basically just photoreceptors that detected areas of bright light (making it easier to navigate in the water, and hide from predators I think). So at first the loss function wasn’t so picky as to only tolerate perfect image reconstructions of the organism’s surroundings; instead it simply graded good brightness-detection, which I think could today be regarded as one of the “factorised tasks” of vision (although I’m not sure about this).