If a universality statement like the above holds for neural networks, it would tell us that most of the details of the parameter-function map are irrelevant.
I suppose this depends on what you mean by “most”. DNNs and CNNs have noticeable and meaningful differences in their (macroscopic) generalisation behaviour, and these differences are due to their parameter-function map. This is also true of LSTMs vs transformers, and so on. I think it’s fairly likely that these kinds of differences could have a large impact on the probability that a given type model will learn to exhibit goal-directed behaviour in a given training setup, for example.
The ambitious statement here might be that all the relevant information you might care about (in terms of understanding universality) are already contained in the loss landscape.
Do you mean the loss landscape in the limit of infinite data, or the loss landscape for a “small” amount of data? In the former case, the loss landscape determines the parameter-function map over the data distribution. In the latter case, my guess would be that the statement probably is false (though I’m not sure).
I suppose this depends on what you mean by “most”. DNNs and CNNs have noticeable and meaningful differences in their (macroscopic) generalisation behaviour, and these differences are due to their parameter-function map. This is also true of LSTMs vs transformers, and so on. I think it’s fairly likely that these kinds of differences could have a large impact on the probability that a given type model will learn to exhibit goal-directed behaviour in a given training setup, for example.
Do you mean the loss landscape in the limit of infinite data, or the loss landscape for a “small” amount of data? In the former case, the loss landscape determines the parameter-function map over the data distribution. In the latter case, my guess would be that the statement probably is false (though I’m not sure).