I gather node permutation is only one of the symmetries involved, which include both discrete symmetries like permutations and continuous ones such as symmetries involving shifting sets of parameters in ways that produce equivalent network outputs.
As I understand it (and I’m still studying this stuff), the prediction from Singular Learning Theory is that there are large sets of local minima, each set internally isomorphic to each other so having the same loss function value (modulo rounding errors or not having quite settled to the optimum). But the prediction of SLT is that there are generically multiple of these sets, whose loss functions are not the same. The ones whose network representation is simplest (i.e. with lowest Kolmogorov complexity when expressed in this NN architecture) will have the largest isometry symmetry group, so are the easiest to find/are most numerous and densely packed in the space of all NN configurations. So we get Occam’s Razor for free. However, typically the best ones with the best loss values will be larger/more complex, so harder to locate. That is about my current level of understanding of SLT, but I gather that with a large enough training set and suitable SGD learning metaparameter annealing one can avoid settling in a less good lower-complexity minimum and attempt to find a better one, thus improving your loss function result, and there is some theoretical mathematical understanding of how well one an expect to do based on training set size.
I gather node permutation is only one of the symmetries involved, which include both discrete symmetries like permutations and continuous ones such as symmetries involving shifting sets of parameters in ways that produce equivalent network outputs.
As I understand it (and I’m still studying this stuff), the prediction from Singular Learning Theory is that there are large sets of local minima, each set internally isomorphic to each other so having the same loss function value (modulo rounding errors or not having quite settled to the optimum). But the prediction of SLT is that there are generically multiple of these sets, whose loss functions are not the same. The ones whose network representation is simplest (i.e. with lowest Kolmogorov complexity when expressed in this NN architecture) will have the largest isometry symmetry group, so are the easiest to find/are most numerous and densely packed in the space of all NN configurations. So we get Occam’s Razor for free. However, typically the best ones with the best loss values will be larger/more complex, so harder to locate. That is about my current level of understanding of SLT, but I gather that with a large enough training set and suitable SGD learning metaparameter annealing one can avoid settling in a less good lower-complexity minimum and attempt to find a better one, thus improving your loss function result, and there is some theoretical mathematical understanding of how well one an expect to do based on training set size.