However, there are mostly no such constraints in ANN training (by default), so it doesn’t seem destined to me that LLM behaviour should “compress” very much
The point of the Singular Learning Theory digression was to help make legible why I think this is importantly false. NN training has a strong simplicity bias, basically regardless of the optimizer used for training, and even in the absence of any explicit regularisation. This bias towards compression is a result of the particular degenerate structure of NN loss landscapes, which are in turn a result of the NN architectures themselves. Simpler solutions in these loss landscapes have a lower “learning coefficient,” which you might conceptualize as an “effective” parameter count, meaning they occupy more (or higher dimensional, in the idealized case) volume in the loss landscape than more complicated solutions with higher learning coefficients.
This bias in the loss landscapes isn’t quite about simplicity alone. It might perhaps be thought of as a particular mix of a simplicity prior, and a peculiar kind of speed prior.
That is why Deep Learning works in the first place. That is why NN training can readily yield solutions that generalize far past the training data, even when you have substantially more parameters than data points to fit on. That is why, with a bit of fiddling around, training a transformer can get you a language model, whereas training a giant polynomial on predicting internet text will not get you a program that can talk. SGD or no SGD, momentum or no momentum, weight regularisation or no weight regularisation. Because polynomial loss landscapes do not look like NN loss landscapes.
I agree with you, but it’s not clear that in lieu of explicit regularisation, DNNs, in particular LLMs, will compress to the degree that they become intelligible (interpretable) to humans. That is, their effective dimensionality might be reduced from 1T to 100M or whatever, but that would be still way too much for humans to comprehend. Explicit regularisation drives this effective dimensionality down.
The point of the Singular Learning Theory digression was to help make legible why I think this is importantly false. NN training has a strong simplicity bias, basically regardless of the optimizer used for training, and even in the absence of any explicit regularisation. This bias towards compression is a result of the particular degenerate structure of NN loss landscapes, which are in turn a result of the NN architectures themselves. Simpler solutions in these loss landscapes have a lower “learning coefficient,” which you might conceptualize as an “effective” parameter count, meaning they occupy more (or higher dimensional, in the idealized case) volume in the loss landscape than more complicated solutions with higher learning coefficients.
This bias in the loss landscapes isn’t quite about simplicity alone. It might perhaps be thought of as a particular mix of a simplicity prior, and a peculiar kind of speed prior.
That is why Deep Learning works in the first place. That is why NN training can readily yield solutions that generalize far past the training data, even when you have substantially more parameters than data points to fit on. That is why, with a bit of fiddling around, training a transformer can get you a language model, whereas training a giant polynomial on predicting internet text will not get you a program that can talk. SGD or no SGD, momentum or no momentum, weight regularisation or no weight regularisation. Because polynomial loss landscapes do not look like NN loss landscapes.
I agree with you, but it’s not clear that in lieu of explicit regularisation, DNNs, in particular LLMs, will compress to the degree that they become intelligible (interpretable) to humans. That is, their effective dimensionality might be reduced from 1T to 100M or whatever, but that would be still way too much for humans to comprehend. Explicit regularisation drives this effective dimensionality down.