This is probably true for neural networks in particular, but mathematically speaking, it completely depends on how you parameterise the functions. You can create a parameterisation in which this is not true.
Agreed. So maybe what I’m actually trying to get at it is a statement about what “universality” means in the context of neural networks. Just as the microscopic details of physical theories don’t matter much to their macroscopic properties in the vicinity of critical points (“universality” in statistical physics), just as the microscopic details of random matrices don’t seem to matter for their bulk and edge statistics (“universality” in random matrix theory), many of the particular choices of neural network architecture doesn’t seem to matter for learned representations (“universality” in DL).
What physics and random matrix theory tell us is that a given system’s universality class is determined by its symmetries. (This starts to get at why we SLT enthusiasts are so obsessed with neural network symmetries.) In the case of learning machines, those symmetries are fixed by the parameter-function map, so I totally agree that you need to understand the parameter-function map.
However, focusing on symmetries is already a pretty major restriction. If a universality statement like the above holds for neural networks, it would tell us that most of the details of the parameter-function map are irrelevant.
There’s another important observation, which is that neural network symmetries leave geometric traces. Even if the RLCT on its own does not “solve” generalization, the SLT-inspired geometric perspective might still hold the answer: it should be possible to distinguish neural networks from the polynomial example you provided by understanding the geometry of the loss landscape. The ambitious statement here might be that all the relevant information you might care about (in terms of understanding universality) are already contained in the loss landscape.
If that’s the case, my concern about focusing on the parameter-function map is that it would pose a distraction. It could miss the forest for the trees if you’re trying to understand the structure that develops and phenomena like generalization. I expect the more fruitful perspective to remain anchored in geometry.
Is this not satisfied trivially due to the fact that the RLCT has a certain maximum and minimum value within each model class? (If we stick to the assumption that Θ is compact, etc.)
If a universality statement like the above holds for neural networks, it would tell us that most of the details of the parameter-function map are irrelevant.
I suppose this depends on what you mean by “most”. DNNs and CNNs have noticeable and meaningful differences in their (macroscopic) generalisation behaviour, and these differences are due to their parameter-function map. This is also true of LSTMs vs transformers, and so on. I think it’s fairly likely that these kinds of differences could have a large impact on the probability that a given type model will learn to exhibit goal-directed behaviour in a given training setup, for example.
The ambitious statement here might be that all the relevant information you might care about (in terms of understanding universality) are already contained in the loss landscape.
Do you mean the loss landscape in the limit of infinite data, or the loss landscape for a “small” amount of data? In the former case, the loss landscape determines the parameter-function map over the data distribution. In the latter case, my guess would be that the statement probably is false (though I’m not sure).
Agreed. So maybe what I’m actually trying to get at it is a statement about what “universality” means in the context of neural networks. Just as the microscopic details of physical theories don’t matter much to their macroscopic properties in the vicinity of critical points (“universality” in statistical physics), just as the microscopic details of random matrices don’t seem to matter for their bulk and edge statistics (“universality” in random matrix theory), many of the particular choices of neural network architecture doesn’t seem to matter for learned representations (“universality” in DL).
What physics and random matrix theory tell us is that a given system’s universality class is determined by its symmetries. (This starts to get at why we SLT enthusiasts are so obsessed with neural network symmetries.) In the case of learning machines, those symmetries are fixed by the parameter-function map, so I totally agree that you need to understand the parameter-function map.
However, focusing on symmetries is already a pretty major restriction. If a universality statement like the above holds for neural networks, it would tell us that most of the details of the parameter-function map are irrelevant.
There’s another important observation, which is that neural network symmetries leave geometric traces. Even if the RLCT on its own does not “solve” generalization, the SLT-inspired geometric perspective might still hold the answer: it should be possible to distinguish neural networks from the polynomial example you provided by understanding the geometry of the loss landscape. The ambitious statement here might be that all the relevant information you might care about (in terms of understanding universality) are already contained in the loss landscape.
If that’s the case, my concern about focusing on the parameter-function map is that it would pose a distraction. It could miss the forest for the trees if you’re trying to understand the structure that develops and phenomena like generalization. I expect the more fruitful perspective to remain anchored in geometry.
Hmm, maybe restrict f so it has to range over R.
I suppose this depends on what you mean by “most”. DNNs and CNNs have noticeable and meaningful differences in their (macroscopic) generalisation behaviour, and these differences are due to their parameter-function map. This is also true of LSTMs vs transformers, and so on. I think it’s fairly likely that these kinds of differences could have a large impact on the probability that a given type model will learn to exhibit goal-directed behaviour in a given training setup, for example.
Do you mean the loss landscape in the limit of infinite data, or the loss landscape for a “small” amount of data? In the former case, the loss landscape determines the parameter-function map over the data distribution. In the latter case, my guess would be that the statement probably is false (though I’m not sure).