Which may seem rather non-obvious. Intuitively, you might think that the two modules scenario has more constraints on the parameters than the one module scenario, since there’s two places in the network where you’re demanding particular behaviour rather than one.
Doesn’t more constraints mean less freedom and therefore a less broadness in parameter space?
(Sorry if that’s a stupid question, I don’t really understand the reasoning behind the whole connection yet.)
(And thanks, the last two paragraphs were helpful, though I didn’t look into the math!)
Yes, that was the point. At least at first blush, this line of argument looks like it’s showing the opposite of what it purports to, so maybe it isn’t that great of an explanation.
On a separate note, I think the math I referenced above can now be updated to say: broadness is dependent on the number of orthogonalfeatures a network has, and how large the norm of these features is. Where both feature orthogonality and norm are defined by the L2 Hilbert space norm, which you may know from quantum mechanics.
This neatly encapsulates, extends, and quantifies the “information loss” notion in Vivek’s linked post above. It also sounds a lot like it’s formalising intuitions about broadness being connected to “generality”, “simplicity”, and lack of “fine tuning”.
It also makes me suspect that the orthogonal feature basis is the fundamentally correct way to think about computations in neural networks.
Post on this incoming once I figure out how to explain it to people who haven’t used Hilbert space before.
Doesn’t more constraints mean less freedom and therefore a less broadness in parameter space?
(Sorry if that’s a stupid question, I don’t really understand the reasoning behind the whole connection yet.)
(And thanks, the last two paragraphs were helpful, though I didn’t look into the math!)
Yes, that was the point. At least at first blush, this line of argument looks like it’s showing the opposite of what it purports to, so maybe it isn’t that great of an explanation.
On a separate note, I think the math I referenced above can now be updated to say: broadness is dependent on the number of orthogonal features a network has, and how large the norm of these features is. Where both feature orthogonality and norm are defined by the L2 Hilbert space norm, which you may know from quantum mechanics.
This neatly encapsulates, extends, and quantifies the “information loss” notion in Vivek’s linked post above. It also sounds a lot like it’s formalising intuitions about broadness being connected to “generality”, “simplicity”, and lack of “fine tuning”.
It also makes me suspect that the orthogonal feature basis is the fundamentally correct way to think about computations in neural networks.
Post on this incoming once I figure out how to explain it to people who haven’t used Hilbert space before.