Jesse Hoogland comments on Neural networks generalize because of this one weird trick

Jesse Hoogland 19 Jan 2023 17:52 UTC
1 point
0
Let me see if I understand your question correctly. Are you asking: does the effective dimensionality / complexity / RLCT ( $λ$ ) actually tell us something different from the number of non-zero weights? And if the optimization method we’re currently using already finds low-complexity solutions, why do we need to worry about it anyway?
So the RLCT tells us the “effective dimensionality” at the largest singularity. This is different from the number of non-zero weights because there are other symmetries that the network can take advantage of. The claim currently is more descriptive than prescriptive. It says that if you are doing Bayesian inference, then, in the limiting case of large datasets, this RLCT (which is a local thing) ends up having a global effect on your expected behavior. This is true even if your model is not actually at the RLCT.
So this isn’t currently proposing a new kind of optimization technique. Rather, it’s making a claim about which features of the loss landscape end up having most influence on the training dynamics you see. This is exact for the case of Bayesian inference but still conjectural for real NNs (though there is early supporting evidence from experiments).