This CLT mixing effect might be expected to destroy information in the representations, as occurs in the NTK limit of infinite width where the CLT becomes infinitely strong and no information can be propagated between layers. It is not clear how the network preserves specific and detailed information in its activations despite near-Gaussian mixing.
They develop a first-order perturbative correction to NTK, where the perturbative parameter is depth-to-width ratio of the network. The resulting distributions are “nearly Gaussian,” with a non-Gaussian correction controlled by the depth-to-width ratio.
Roughly, the authors claim that this regime—where the O(depth/width) correction to NTK is important but higher-order corrections can be neglected—is not only tractable, but also where real NNs operate. They make a number of claims about why you’d want the depth-to-width ratio to be small but nonzero, such as
If the ratio is zero, there’s no feature learning (NTK). But feature learning does occur in the first-order (small but nonzero) theory, so maybe that’s “enough.”
As the ratio grows larger, vanishing/exploding activations and gradients become more and more likely, when considered across different initialization draws, test inputs, etc. -- even if you pick an initialization scheme that is well behaved on average.
They make an argument connecting this ratio to the bias-variance tradeoff, where overly deep/narrow networks become overly high-variance. (IIUC this is the extension of “across initialization draws, test inputs, etc.” in the previous point to ”...across draws of the training data.”)
They also have another argument involving mutual information … suffice it to say they have a lot of these arguments :)
(I have only skimmed the book and can’t really claim to understand it, so I’m mostly bringing it up because it sounds like you’d find it relevant.)
I think that the post and analysis is some evidence that it might perhaps be tractable to apply tools from the book directly to transformer architectures and LLMs.
Have you looked at Roberts and Yaida’s Principles of Deep Learning Theory?
They develop a first-order perturbative correction to NTK, where the perturbative parameter is depth-to-width ratio of the network. The resulting distributions are “nearly Gaussian,” with a non-Gaussian correction controlled by the depth-to-width ratio.
Roughly, the authors claim that this regime—where the O(depth/width) correction to NTK is important but higher-order corrections can be neglected—is not only tractable, but also where real NNs operate. They make a number of claims about why you’d want the depth-to-width ratio to be small but nonzero, such as
If the ratio is zero, there’s no feature learning (NTK). But feature learning does occur in the first-order (small but nonzero) theory, so maybe that’s “enough.”
As the ratio grows larger, vanishing/exploding activations and gradients become more and more likely, when considered across different initialization draws, test inputs, etc. -- even if you pick an initialization scheme that is well behaved on average.
They make an argument connecting this ratio to the bias-variance tradeoff, where overly deep/narrow networks become overly high-variance. (IIUC this is the extension of “across initialization draws, test inputs, etc.” in the previous point to ”...across draws of the training data.”)
They also have another argument involving mutual information … suffice it to say they have a lot of these arguments :)
(I have only skimmed the book and can’t really claim to understand it, so I’m mostly bringing it up because it sounds like you’d find it relevant.)
Thanks for your summary of the book!
I think that the post and analysis is some evidence that it might perhaps be tractable to apply tools from the book directly to transformer architectures and LLMs.