Dropping some late answers here—though this isn’t my subfield, so forgive me if I mess things up here.
Correct me if I’m wrong, but it struck while reading this that you can think of a neural network as learning two things at once:
a classification of the input into 2^N different classes (where N is the total number of neurons), each of which gets a different function applied to it
those functions themselves
This is exactly what a spline is! This is where the spline view of neural networks comes from (mentioned in Appendix C of the post). What you call “classes” the literature typically calls the “partition.” Also, while deep networks can theoretically have exponentially many elements in the partition (w.r.t. the number of neurons), in practice, they instead are closer to linear.
Can the functions and classes be decoupled?
To my understanding this is exactly what previous (non-ML) research on splines did, with things like free-knot splines. Unfortunately this is computationally intractable. So instead much research focused on fixing the partition (say, to a uniform grid), and changing only the functions. A well-known example here is the wavelet transform. But then you lose the flexibility to change the partition—incredibly important if some regions need higher resolution than others!
From this perspective the coupling of functions to the partition is exactly what makes neural networks good approximators in the first place! It allows you to freely move the partition, like with free-knot splines, but in a way that’s still computationally tractable. Intuitively, neural networks have the ability to use high resolution where it’s needed most, like how 3D meshes of video game characters have the most polygons in their face.
How much of the power of neural networks comes from their ability to learn to classify something into exponentially many different classes vs from the linear transformations that each class implements?
There are varying answers here, depending on what you mean by “power”: I’d say either the first or neither. If you mean “the ability to approximate efficiently,” then I would probably say that the partition matters more—assuming the partition is sufficiently fine, each linear transformation only performs a “first order correction” to the mean value of the partition.
But I don’t really think this is where the “magic” of deep learning comes from. In fact this approximation property holds for all neural networks, including shallow ones. It can’t capture what I see as the most important properties, like what makes deep networks generalize well OOD. For that you need to look elsewhere. It appears like deep neural networks have an inductive bias towards simple algorithms, i.e. those with a low (pseudo) Kolmogorov complexity. (IMO, from the spline perspective, a promising direction to explain this could be via compositionality and degeneracy of spline operators.)
Dropping some late answers here—though this isn’t my subfield, so forgive me if I mess things up here.
This is exactly what a spline is! This is where the spline view of neural networks comes from (mentioned in Appendix C of the post). What you call “classes” the literature typically calls the “partition.” Also, while deep networks can theoretically have exponentially many elements in the partition (w.r.t. the number of neurons), in practice, they instead are closer to linear.
To my understanding this is exactly what previous (non-ML) research on splines did, with things like free-knot splines. Unfortunately this is computationally intractable. So instead much research focused on fixing the partition (say, to a uniform grid), and changing only the functions. A well-known example here is the wavelet transform. But then you lose the flexibility to change the partition—incredibly important if some regions need higher resolution than others!
From this perspective the coupling of functions to the partition is exactly what makes neural networks good approximators in the first place! It allows you to freely move the partition, like with free-knot splines, but in a way that’s still computationally tractable. Intuitively, neural networks have the ability to use high resolution where it’s needed most, like how 3D meshes of video game characters have the most polygons in their face.
There are varying answers here, depending on what you mean by “power”: I’d say either the first or neither. If you mean “the ability to approximate efficiently,” then I would probably say that the partition matters more—assuming the partition is sufficiently fine, each linear transformation only performs a “first order correction” to the mean value of the partition.
But I don’t really think this is where the “magic” of deep learning comes from. In fact this approximation property holds for all neural networks, including shallow ones. It can’t capture what I see as the most important properties, like what makes deep networks generalize well OOD. For that you need to look elsewhere. It appears like deep neural networks have an inductive bias towards simple algorithms, i.e. those with a low (pseudo) Kolmogorov complexity. (IMO, from the spline perspective, a promising direction to explain this could be via compositionality and degeneracy of spline operators.)
Hope this helps!