(responding to Jacob specifically here) A lot of things that were thought of as “obvious” were later found out to be false in the context of deep learning—for instance the bias-variance trade-off.
I think what you’re saying makes sense at a high/rough level but I’m also worried you are not being rigorous enough. It is true and well known that L2 regularization can be derived from Bayesian neural nets with a Gaussian prior on the weights. However neural nets in deep learning are trained via SGD, not with Bayesian updating—and it doesn’t seem modern CNNs actually approximate their Bayesian cousins that well—otherwise they would be better calibrated I would think. However, I think overall what you’re saying makes sense.
If we were going to really look at this rigorously we’d have to define what we mean by compressibility too. One way might be via some type of lossy compression using model pruning or some form of distillation. Have their been studies showing models that use Dropout can be pruned down more or distilled easier?
However neural nets in deep learning are trained via SGD, not with Bayesian updating
SGD is a form of efficient approximate Bayesian updating. More specifically it’s a local linear 1st order approximation. As the step size approaches zero this approximation becomes tight, under some potentially enormous simplifying assumptions of unit variance (which are in practice enforced through initialization and explicit normalization).
But anyway that’s not directly relevant, as Bayesian updating doesn’t have some monopoly on entropy/complexity tradeoffs.
If you want to be ‘rigorous’, then you shouldn’t have confidently said:
Even if biasing towards simpler models is a good idea, we don’t have a good way of doing this in deep learning yet, apart from restricting the number of parameters,
(As you can’t rigorously back that statement up). Regularization to bias towards simpler models in DL absolutely works well, regardless of whether you understand it or find the provided explanations satisfactory.
SGD is a form of efficient approximate Bayesian updating.
Yeah I saw you were arguing that in one of your posts. I’ll take a closer look. I honestly have not heard of this before.
Regarding my statement—I agree looking back at it it is horribly sloppy and sounds absurd, but when I was writing I was just thinking about how all L1 and L2 regularization do is bias towards smaller weights—the models still take up the same amount of space on disk and require the same amount amount of compute to run in terms of FLOPs. But yes you’re right they make the models easier to approximate.
So actually L1/L2 regularization does allow you to compress the model by reducing entropy, as evidenced by the fact that any effective pruning/quantization system necessarily involves some strong regularizer applied during training or after.
The model itself can’t possibly know or care whether you later actually compress said weights or not, so it’s never the actual compression itself that matters, vs the inherent compressibility (which comes from the regularization).
(responding to Jacob specifically here) A lot of things that were thought of as “obvious” were later found out to be false in the context of deep learning—for instance the bias-variance trade-off.
I think what you’re saying makes sense at a high/rough level but I’m also worried you are not being rigorous enough. It is true and well known that L2 regularization can be derived from Bayesian neural nets with a Gaussian prior on the weights. However neural nets in deep learning are trained via SGD, not with Bayesian updating—and it doesn’t seem modern CNNs actually approximate their Bayesian cousins that well—otherwise they would be better calibrated I would think. However, I think overall what you’re saying makes sense.
If we were going to really look at this rigorously we’d have to define what we mean by compressibility too. One way might be via some type of lossy compression using model pruning or some form of distillation. Have their been studies showing models that use Dropout can be pruned down more or distilled easier?
SGD is a form of efficient approximate Bayesian updating. More specifically it’s a local linear 1st order approximation. As the step size approaches zero this approximation becomes tight, under some potentially enormous simplifying assumptions of unit variance (which are in practice enforced through initialization and explicit normalization).
But anyway that’s not directly relevant, as Bayesian updating doesn’t have some monopoly on entropy/complexity tradeoffs.
If you want to be ‘rigorous’, then you shouldn’t have confidently said:
(As you can’t rigorously back that statement up). Regularization to bias towards simpler models in DL absolutely works well, regardless of whether you understand it or find the provided explanations satisfactory.
Yeah I saw you were arguing that in one of your posts. I’ll take a closer look. I honestly have not heard of this before.
Regarding my statement—I agree looking back at it it is horribly sloppy and sounds absurd, but when I was writing I was just thinking about how all L1 and L2 regularization do is bias towards smaller weights—the models still take up the same amount of space on disk and require the same amount amount of compute to run in terms of FLOPs. But yes you’re right they make the models easier to approximate.
So actually L1/L2 regularization does allow you to compress the model by reducing entropy, as evidenced by the fact that any effective pruning/quantization system necessarily involves some strong regularizer applied during training or after.
The model itself can’t possibly know or care whether you later actually compress said weights or not, so it’s never the actual compression itself that matters, vs the inherent compressibility (which comes from the regularization).