Well, isn’t having multiple modules a precondition to something being modular? That seems like what’s happening in your example: it has only one module, so it doesn’t even make sense to apply John’s criterion.
I think Steven’s point is that if your explanation for modularity leading to broadness is that the parameters inside a module can take any configuration, conditioned on the output of the module staying the same, then you’re at least missing an additional step showing that a network consisting of two modules with n/2 parameters each has more freedom in those parameters than a network consisting of one module (just the entire network itself) with n parameters does. Otherwise you’re not actually pointing out how this favours modularity over non-modularity.
Which may seem rather non-obvious. Intuitively, you might think that the two modules scenario has more constraints on the parameters than the one module scenario, since there’s two places in the network where you’re demanding particular behaviour rather than one.
My own guiding intuition for why modularity seems to cause broadness goes the more circumspect path of “modularity seems connected to abstraction, abstraction seems connected to generality, generality seems connected to broadness in parameter space”.
I think the hand-wavy math we currently have also points more towards this connection. It seems to talk about how broadness is connected to dropping information about the input, as much as you can while still getting the right answer. Which sure looks suggestively like a statement about avoiding fine tuning. And modules are things that only give out small summaries of what goes on inside them, rather than propagating all the information they contain.
Which may seem rather non-obvious. Intuitively, you might think that the two modules scenario has more constraints on the parameters than the one module scenario, since there’s two places in the network where you’re demanding particular behaviour rather than one.
Doesn’t more constraints mean less freedom and therefore a less broadness in parameter space?
(Sorry if that’s a stupid question, I don’t really understand the reasoning behind the whole connection yet.)
(And thanks, the last two paragraphs were helpful, though I didn’t look into the math!)
Yes, that was the point. At least at first blush, this line of argument looks like it’s showing the opposite of what it purports to, so maybe it isn’t that great of an explanation.
On a separate note, I think the math I referenced above can now be updated to say: broadness is dependent on the number of orthogonalfeatures a network has, and how large the norm of these features is. Where both feature orthogonality and norm are defined by the L2 Hilbert space norm, which you may know from quantum mechanics.
This neatly encapsulates, extends, and quantifies the “information loss” notion in Vivek’s linked post above. It also sounds a lot like it’s formalising intuitions about broadness being connected to “generality”, “simplicity”, and lack of “fine tuning”.
It also makes me suspect that the orthogonal feature basis is the fundamentally correct way to think about computations in neural networks.
Post on this incoming once I figure out how to explain it to people who haven’t used Hilbert space before.
Well, isn’t having multiple modules a precondition to something being modular? That seems like what’s happening in your example: it has only one module, so it doesn’t even make sense to apply John’s criterion.
I think Steven’s point is that if your explanation for modularity leading to broadness is that the parameters inside a module can take any configuration, conditioned on the output of the module staying the same, then you’re at least missing an additional step showing that a network consisting of two modules with n/2 parameters each has more freedom in those parameters than a network consisting of one module (just the entire network itself) with n parameters does. Otherwise you’re not actually pointing out how this favours modularity over non-modularity.
Which may seem rather non-obvious. Intuitively, you might think that the two modules scenario has more constraints on the parameters than the one module scenario, since there’s two places in the network where you’re demanding particular behaviour rather than one.
My own guiding intuition for why modularity seems to cause broadness goes the more circumspect path of “modularity seems connected to abstraction, abstraction seems connected to generality, generality seems connected to broadness in parameter space”.
I think the hand-wavy math we currently have also points more towards this connection. It seems to talk about how broadness is connected to dropping information about the input, as much as you can while still getting the right answer. Which sure looks suggestively like a statement about avoiding fine tuning. And modules are things that only give out small summaries of what goes on inside them, rather than propagating all the information they contain.
Doesn’t more constraints mean less freedom and therefore a less broadness in parameter space?
(Sorry if that’s a stupid question, I don’t really understand the reasoning behind the whole connection yet.)
(And thanks, the last two paragraphs were helpful, though I didn’t look into the math!)
Yes, that was the point. At least at first blush, this line of argument looks like it’s showing the opposite of what it purports to, so maybe it isn’t that great of an explanation.
On a separate note, I think the math I referenced above can now be updated to say: broadness is dependent on the number of orthogonal features a network has, and how large the norm of these features is. Where both feature orthogonality and norm are defined by the L2 Hilbert space norm, which you may know from quantum mechanics.
This neatly encapsulates, extends, and quantifies the “information loss” notion in Vivek’s linked post above. It also sounds a lot like it’s formalising intuitions about broadness being connected to “generality”, “simplicity”, and lack of “fine tuning”.
It also makes me suspect that the orthogonal feature basis is the fundamentally correct way to think about computations in neural networks.
Post on this incoming once I figure out how to explain it to people who haven’t used Hilbert space before.