I was trying to make a more specific point. Let me know if you think the distinction is meaningful -
So there are lots of “semi-rigorous” successes in Deep Learning. One I understand better than muP is good old Xavier initialization. Assuming that the activations at a given layer are normally distributed, we should scale our weights like 1/sqrt(n) so the activations don’t diverge from layer to layer (since the sum of independent normals scales like sqrt(n)). This is exactly true for the first gradient step, but can become false at any later step once the weights are no longer independent and can “conspire” to blow up the activations anyway. So not a “proof” but successful in practice.
My understanding of muP is similar in that is “precise” only if certain correlations are well controlled (I’m fuzzy here). But still v. successful in practice. But the proof isn’t airtight and we still need that last step—checking “in practice”.
This is very different from the situation within statistical learning itself, which has many beautiful and unconditional results which we would very much like to port over to deep learning. My central point is that in the absence of a formal correspondence, we have to bridge the gap with evidence. That’s why the last part of my comment was some evidence I think speaks against the intuitions of statistical learning theory. I consider Xavier initialization, muP, and scaling laws, etc. as examples where this bridge was successfully crossed—but still necessary! And so we’re reduced to “arguing over evidence between paradigms” when we’d prefer to “prove results within a paradigm”
I was trying to make a more specific point. Let me know if you think the distinction is meaningful -
So there are lots of “semi-rigorous” successes in Deep Learning. One I understand better than muP is good old Xavier initialization. Assuming that the activations at a given layer are normally distributed, we should scale our weights like 1/sqrt(n) so the activations don’t diverge from layer to layer (since the sum of independent normals scales like sqrt(n)). This is exactly true for the first gradient step, but can become false at any later step once the weights are no longer independent and can “conspire” to blow up the activations anyway. So not a “proof” but successful in practice.
My understanding of muP is similar in that is “precise” only if certain correlations are well controlled (I’m fuzzy here). But still v. successful in practice. But the proof isn’t airtight and we still need that last step—checking “in practice”.
This is very different from the situation within statistical learning itself, which has many beautiful and unconditional results which we would very much like to port over to deep learning. My central point is that in the absence of a formal correspondence, we have to bridge the gap with evidence. That’s why the last part of my comment was some evidence I think speaks against the intuitions of statistical learning theory. I consider Xavier initialization, muP, and scaling laws, etc. as examples where this bridge was successfully crossed—but still necessary! And so we’re reduced to “arguing over evidence between paradigms” when we’d prefer to “prove results within a paradigm”