Correct me if I’m wrong, but it struck while reading this that you can think of a neural network as learning two things at once:
a classification of the input into 2^N different classes (where N is the total number of neurons), each of which gets a different function applied to it
those functions themselves
(also each function is a linear transformation)
The 2^N classes would be all the different polytopes, and each function would be the linear transformation that the network implements when a polytope’s neurons are on and all others are off.
To me this suggests some questions:
Can the functions and classes be decoupled?
It seems a little odd that the weights in a given layer are kind of doing double-duty — they’re both helping to implement some particular linear transformation that the whole network performs (transforming the kind of data you feed into the network at the input into the kind of data you expect at the output), and they’re also helping to determine which linear function the rest of the network is going to perform (by tipping downstream neurons’ activations over their thresholds or not).
Is that at all a sensible way to look at it?
Could you come up with some other scheme for choosing between a whole bunch of different linear transformations?
How much of the power of neural networks comes from their ability to learn to classify something into exponentially many different classes vs from the linear transformations that each class implements?
Does this question even make sense?
The 2^N different linear transformations can’t be arbitrary. As mentioned in the post, there’s the constraint that neighboring polytopes implement very similar transformations, because their weight matrices vary by just one row.
What would be a reasonable way to measure this degree of constrained-ness?
This comment has been the result of a layman thinking out loud. Appreciate anyone who can answer my questions or point out any misunderstandings!
Dropping some late answers here—though this isn’t my subfield, so forgive me if I mess things up here.
Correct me if I’m wrong, but it struck while reading this that you can think of a neural network as learning two things at once:
a classification of the input into 2^N different classes (where N is the total number of neurons), each of which gets a different function applied to it
those functions themselves
This is exactly what a spline is! This is where the spline view of neural networks comes from (mentioned in Appendix C of the post). What you call “classes” the literature typically calls the “partition.” Also, while deep networks can theoretically have exponentially many elements in the partition (w.r.t. the number of neurons), in practice, they instead are closer to linear.
Can the functions and classes be decoupled?
To my understanding this is exactly what previous (non-ML) research on splines did, with things like free-knot splines. Unfortunately this is computationally intractable. So instead much research focused on fixing the partition (say, to a uniform grid), and changing only the functions. A well-known example here is the wavelet transform. But then you lose the flexibility to change the partition—incredibly important if some regions need higher resolution than others!
From this perspective the coupling of functions to the partition is exactly what makes neural networks good approximators in the first place! It allows you to freely move the partition, like with free-knot splines, but in a way that’s still computationally tractable. Intuitively, neural networks have the ability to use high resolution where it’s needed most, like how 3D meshes of video game characters have the most polygons in their face.
How much of the power of neural networks comes from their ability to learn to classify something into exponentially many different classes vs from the linear transformations that each class implements?
There are varying answers here, depending on what you mean by “power”: I’d say either the first or neither. If you mean “the ability to approximate efficiently,” then I would probably say that the partition matters more—assuming the partition is sufficiently fine, each linear transformation only performs a “first order correction” to the mean value of the partition.
But I don’t really think this is where the “magic” of deep learning comes from. In fact this approximation property holds for all neural networks, including shallow ones. It can’t capture what I see as the most important properties, like what makes deep networks generalize well OOD. For that you need to look elsewhere. It appears like deep neural networks have an inductive bias towards simple algorithms, i.e. those with a low (pseudo) Kolmogorov complexity. (IMO, from the spline perspective, a promising direction to explain this could be via compositionality and degeneracy of spline operators.)
Thanks for your interest in our post and your questions!
Correct me if I’m wrong, but it struck while reading this that you can think of a neural network as learning two things at once…
That seems right!
Can the functions and classes be decoupled? … Could you come up with some other scheme for choosing between a whole bunch of different linear transformations?
It seems possible to come up with other schemes that do this; it just doesn’t seem easy to come up with something that is competitive with neural nets. If I recall correctly, there’s work in previous decades (which I’m struggling to find right now, although it’s easy to find similar more modern work e.g. https://pubmed.ncbi.nlm.nih.gov/23272922/ ) that builds a nonlinear dynamical system using N linear regions centred on N points. This work models a dynamical system, but there’s no reason we can’t just use the same principles for purely feedforward networks. The dynamics of the system are defined by whichever point the current state is closest to. The linear regions can have whatever dynamics you want. But then you’d have to store and look up N matrices, which isn’t great when N is large!
How much of the power of neural networks comes from their ability to learn to classify something into exponentially many different classes vs from the linear transformations that each class implements? Does this question even make sense?
I guess this depends on what you mean by ‘power’.
The 2^N different linear transformations can’t be arbitrary. As mentioned in the post, there’s the constraint that neighboring polytopes implement very similar transformations, because their weight matrices vary by just one row. What would be a reasonable way to measure this degree of constrained-ness?
I’m not sure! On the one hand, we could measure the dissimilarity of the transformations as the Frobenius norm (i.e. distance in matrix-space) of the difference matrix between linearized transformations on both sides of a polytope boundary. On the other hand, this difference can be arbitrarily large if the weights of our model are unbounded, because crossing some polytope boundaries might mean that a neuron with arbitrarily large weights turns on or off.
Correct me if I’m wrong, but it struck while reading this that you can think of a neural network as learning two things at once:
a classification of the input into 2^N different classes (where N is the total number of neurons), each of which gets a different function applied to it
those functions themselves
(also each function is a linear transformation)
The 2^N classes would be all the different polytopes, and each function would be the linear transformation that the network implements when a polytope’s neurons are on and all others are off.
To me this suggests some questions:
Can the functions and classes be decoupled?
It seems a little odd that the weights in a given layer are kind of doing double-duty — they’re both helping to implement some particular linear transformation that the whole network performs (transforming the kind of data you feed into the network at the input into the kind of data you expect at the output), and they’re also helping to determine which linear function the rest of the network is going to perform (by tipping downstream neurons’ activations over their thresholds or not).
Is that at all a sensible way to look at it?
Could you come up with some other scheme for choosing between a whole bunch of different linear transformations?
How much of the power of neural networks comes from their ability to learn to classify something into exponentially many different classes vs from the linear transformations that each class implements?
Does this question even make sense?
The 2^N different linear transformations can’t be arbitrary. As mentioned in the post, there’s the constraint that neighboring polytopes implement very similar transformations, because their weight matrices vary by just one row.
What would be a reasonable way to measure this degree of constrained-ness?
This comment has been the result of a layman thinking out loud. Appreciate anyone who can answer my questions or point out any misunderstandings!
Dropping some late answers here—though this isn’t my subfield, so forgive me if I mess things up here.
This is exactly what a spline is! This is where the spline view of neural networks comes from (mentioned in Appendix C of the post). What you call “classes” the literature typically calls the “partition.” Also, while deep networks can theoretically have exponentially many elements in the partition (w.r.t. the number of neurons), in practice, they instead are closer to linear.
To my understanding this is exactly what previous (non-ML) research on splines did, with things like free-knot splines. Unfortunately this is computationally intractable. So instead much research focused on fixing the partition (say, to a uniform grid), and changing only the functions. A well-known example here is the wavelet transform. But then you lose the flexibility to change the partition—incredibly important if some regions need higher resolution than others!
From this perspective the coupling of functions to the partition is exactly what makes neural networks good approximators in the first place! It allows you to freely move the partition, like with free-knot splines, but in a way that’s still computationally tractable. Intuitively, neural networks have the ability to use high resolution where it’s needed most, like how 3D meshes of video game characters have the most polygons in their face.
There are varying answers here, depending on what you mean by “power”: I’d say either the first or neither. If you mean “the ability to approximate efficiently,” then I would probably say that the partition matters more—assuming the partition is sufficiently fine, each linear transformation only performs a “first order correction” to the mean value of the partition.
But I don’t really think this is where the “magic” of deep learning comes from. In fact this approximation property holds for all neural networks, including shallow ones. It can’t capture what I see as the most important properties, like what makes deep networks generalize well OOD. For that you need to look elsewhere. It appears like deep neural networks have an inductive bias towards simple algorithms, i.e. those with a low (pseudo) Kolmogorov complexity. (IMO, from the spline perspective, a promising direction to explain this could be via compositionality and degeneracy of spline operators.)
Hope this helps!
Thanks for your interest in our post and your questions!
That seems right!
It seems possible to come up with other schemes that do this; it just doesn’t seem easy to come up with something that is competitive with neural nets. If I recall correctly, there’s work in previous decades (which I’m struggling to find right now, although it’s easy to find similar more modern work e.g. https://pubmed.ncbi.nlm.nih.gov/23272922/ ) that builds a nonlinear dynamical system using N linear regions centred on N points. This work models a dynamical system, but there’s no reason we can’t just use the same principles for purely feedforward networks. The dynamics of the system are defined by whichever point the current state is closest to. The linear regions can have whatever dynamics you want. But then you’d have to store and look up N matrices, which isn’t great when N is large!
I guess this depends on what you mean by ‘power’.
I’m not sure! On the one hand, we could measure the dissimilarity of the transformations as the Frobenius norm (i.e. distance in matrix-space) of the difference matrix between linearized transformations on both sides of a polytope boundary. On the other hand, this difference can be arbitrarily large if the weights of our model are unbounded, because crossing some polytope boundaries might mean that a neuron with arbitrarily large weights turns on or off.
Thanks for the answers!