To my knowledge the most used regularization method in deep learning, dropout, doesn’t make models simpler in the sense of being more compressible.
A simple L1 regularization would make models more compressible in so far as it suppresses weights towards zero so they can just be thrown out completely without affecting model performance much. I’m not sure about L2 regularization making things more compressible—does it lead to flatter minima for instance? (GPT-3 uses L2 regularization, which they call “weight decay”).
But yes, you are right, Occam factors are intrinsic to the process of Bayesian model comparison, however that’s in the context of fully probabilistic models, not comparing deterministic models (ie Turing programs) which is what is done in Solomonoff induction. In Solomonoff induction they have to tack Occam’s razor on top.
I didn’t state my issues with Solomonoff induction very well, that is something I hope to summarize in a future post.
Overall I think it’s not clear that Solomonoff induction actually works very well once you restrict it to a finite prior. If the true program isn’t in the prior, for instance, there is no guarantee of convergence—it may just oscillate around forever (the “grain of truth” problem).
There’s other problems too (see a list here, the “Background” part of this post by Vanessa Kosoy, as well as Hutter’s own open problems).
One of Kosoy’s points, I think, is something like this—if an AIXI-like agent has two models that are very similar but one has a weird extra “if then” statement tacked on to help it understand something (like at night the world stops existing and the laws of physics no longer apply, when in actuality the lights in the room just go off) then it may take an extremely long time for an AIXI agent to converge on the correct model because the difference in complexity between the two models is very small.
To my knowledge the most used regularization method in deep learning, dropout, doesn’t make models simpler in the sense of being more compressible.
Yes, it does (as should make sense, because if you can drop out a parameter entirely, you don’t need it, and if it succeeds in fostering modularity or generalization, that should make it much easier to prune), and this was one of the justifications for dropout, and that has nice Bayesian interpretations too. (I have a few relevant cites in my sparsity tag.)
The idea that using dropout makes models simpler is not intuitive to me because according to Hinton dropout essentially does the same thing as ensembling. If what you end up with is something equivalent to an ensemble of smaller networks than it’s not clear to me that would be easier to prune.
One of the papers you linked to appears to study dropout in the context of Bayesian modeling and they argue it encourages sparsity. I’m willing to buy that it does in fact reduce complexity/ compressibility but I’m also not sure any of this is 100% clear cut.
It’s not that dropout provides some ensembling secret sauce; instead neural nets are inherently ensembles proportional to their level of overcompleteness. Dropout (like other regularizers) helps ensure they are ensembles of low complexity sub-models, rather than ensembles of over-fit higher complexity sub-models (see also: lottery tickets, pruning, grokking, double descent).
By the way, if you look at Filan et al.’s paper “Clusterability in Neural Networks” there is a lot of variance in their results but generally speaking they find that L1 regularization leads to slightly more clusterability than L2 or dropout.
To my knowledge the most used regularization method in deep learning, dropout, doesn’t make models simpler in the sense of being more compressible.
L2 regularization is much more common than dropout, but both are a complexity prior and thus compress. This is true in a very obvious way for L2. Dropout is more complex to analyze, but has now been extensively analyzed and functions as a complexity/entropy penalty as all regularization does.
I’m not sure about L2 regularization making things more compressible—does it lead to flatter minima for instance?
L2 regularization (weight decay) obviously makes things more compressible—it penalizes models with high entropy under a per-param gaussian prior. “Flatter minima” isn’t a very useful paradigm for understanding this, vs Bayesian statistics.
(responding to Jacob specifically here) A lot of things that were thought of as “obvious” were later found out to be false in the context of deep learning—for instance the bias-variance trade-off.
I think what you’re saying makes sense at a high/rough level but I’m also worried you are not being rigorous enough. It is true and well known that L2 regularization can be derived from Bayesian neural nets with a Gaussian prior on the weights. However neural nets in deep learning are trained via SGD, not with Bayesian updating—and it doesn’t seem modern CNNs actually approximate their Bayesian cousins that well—otherwise they would be better calibrated I would think. However, I think overall what you’re saying makes sense.
If we were going to really look at this rigorously we’d have to define what we mean by compressibility too. One way might be via some type of lossy compression using model pruning or some form of distillation. Have their been studies showing models that use Dropout can be pruned down more or distilled easier?
However neural nets in deep learning are trained via SGD, not with Bayesian updating
SGD is a form of efficient approximate Bayesian updating. More specifically it’s a local linear 1st order approximation. As the step size approaches zero this approximation becomes tight, under some potentially enormous simplifying assumptions of unit variance (which are in practice enforced through initialization and explicit normalization).
But anyway that’s not directly relevant, as Bayesian updating doesn’t have some monopoly on entropy/complexity tradeoffs.
If you want to be ‘rigorous’, then you shouldn’t have confidently said:
Even if biasing towards simpler models is a good idea, we don’t have a good way of doing this in deep learning yet, apart from restricting the number of parameters,
(As you can’t rigorously back that statement up). Regularization to bias towards simpler models in DL absolutely works well, regardless of whether you understand it or find the provided explanations satisfactory.
SGD is a form of efficient approximate Bayesian updating.
Yeah I saw you were arguing that in one of your posts. I’ll take a closer look. I honestly have not heard of this before.
Regarding my statement—I agree looking back at it it is horribly sloppy and sounds absurd, but when I was writing I was just thinking about how all L1 and L2 regularization do is bias towards smaller weights—the models still take up the same amount of space on disk and require the same amount amount of compute to run in terms of FLOPs. But yes you’re right they make the models easier to approximate.
So actually L1/L2 regularization does allow you to compress the model by reducing entropy, as evidenced by the fact that any effective pruning/quantization system necessarily involves some strong regularizer applied during training or after.
The model itself can’t possibly know or care whether you later actually compress said weights or not, so it’s never the actual compression itself that matters, vs the inherent compressibility (which comes from the regularization).
To my knowledge the most used regularization method in deep learning, dropout, doesn’t make models simpler in the sense of being more compressible.
A simple L1 regularization would make models more compressible in so far as it suppresses weights towards zero so they can just be thrown out completely without affecting model performance much. I’m not sure about L2 regularization making things more compressible—does it lead to flatter minima for instance? (GPT-3 uses L2 regularization, which they call “weight decay”).
But yes, you are right, Occam factors are intrinsic to the process of Bayesian model comparison, however that’s in the context of fully probabilistic models, not comparing deterministic models (ie Turing programs) which is what is done in Solomonoff induction. In Solomonoff induction they have to tack Occam’s razor on top.
I didn’t state my issues with Solomonoff induction very well, that is something I hope to summarize in a future post.
Overall I think it’s not clear that Solomonoff induction actually works very well once you restrict it to a finite prior. If the true program isn’t in the prior, for instance, there is no guarantee of convergence—it may just oscillate around forever (the “grain of truth” problem).
There’s other problems too (see a list here, the “Background” part of this post by Vanessa Kosoy, as well as Hutter’s own open problems).
One of Kosoy’s points, I think, is something like this—if an AIXI-like agent has two models that are very similar but one has a weird extra “if then” statement tacked on to help it understand something (like at night the world stops existing and the laws of physics no longer apply, when in actuality the lights in the room just go off) then it may take an extremely long time for an AIXI agent to converge on the correct model because the difference in complexity between the two models is very small.
Yes, it does (as should make sense, because if you can drop out a parameter entirely, you don’t need it, and if it succeeds in fostering modularity or generalization, that should make it much easier to prune), and this was one of the justifications for dropout, and that has nice Bayesian interpretations too. (I have a few relevant cites in my sparsity tag.)
The idea that using dropout makes models simpler is not intuitive to me because according to Hinton dropout essentially does the same thing as ensembling. If what you end up with is something equivalent to an ensemble of smaller networks than it’s not clear to me that would be easier to prune.
One of the papers you linked to appears to study dropout in the context of Bayesian modeling and they argue it encourages sparsity. I’m willing to buy that it does in fact reduce complexity/ compressibility but I’m also not sure any of this is 100% clear cut.
It’s not that dropout provides some ensembling secret sauce; instead neural nets are inherently ensembles proportional to their level of overcompleteness. Dropout (like other regularizers) helps ensure they are ensembles of low complexity sub-models, rather than ensembles of over-fit higher complexity sub-models (see also: lottery tickets, pruning, grokking, double descent).
By the way, if you look at Filan et al.’s paper “Clusterability in Neural Networks” there is a lot of variance in their results but generally speaking they find that L1 regularization leads to slightly more clusterability than L2 or dropout.
L2 regularization is much more common than dropout, but both are a complexity prior and thus compress. This is true in a very obvious way for L2. Dropout is more complex to analyze, but has now been extensively analyzed and functions as a complexity/entropy penalty as all regularization does.
L2 regularization (weight decay) obviously makes things more compressible—it penalizes models with high entropy under a per-param gaussian prior. “Flatter minima” isn’t a very useful paradigm for understanding this, vs Bayesian statistics.
(responding to Jacob specifically here) A lot of things that were thought of as “obvious” were later found out to be false in the context of deep learning—for instance the bias-variance trade-off.
I think what you’re saying makes sense at a high/rough level but I’m also worried you are not being rigorous enough. It is true and well known that L2 regularization can be derived from Bayesian neural nets with a Gaussian prior on the weights. However neural nets in deep learning are trained via SGD, not with Bayesian updating—and it doesn’t seem modern CNNs actually approximate their Bayesian cousins that well—otherwise they would be better calibrated I would think. However, I think overall what you’re saying makes sense.
If we were going to really look at this rigorously we’d have to define what we mean by compressibility too. One way might be via some type of lossy compression using model pruning or some form of distillation. Have their been studies showing models that use Dropout can be pruned down more or distilled easier?
SGD is a form of efficient approximate Bayesian updating. More specifically it’s a local linear 1st order approximation. As the step size approaches zero this approximation becomes tight, under some potentially enormous simplifying assumptions of unit variance (which are in practice enforced through initialization and explicit normalization).
But anyway that’s not directly relevant, as Bayesian updating doesn’t have some monopoly on entropy/complexity tradeoffs.
If you want to be ‘rigorous’, then you shouldn’t have confidently said:
(As you can’t rigorously back that statement up). Regularization to bias towards simpler models in DL absolutely works well, regardless of whether you understand it or find the provided explanations satisfactory.
Yeah I saw you were arguing that in one of your posts. I’ll take a closer look. I honestly have not heard of this before.
Regarding my statement—I agree looking back at it it is horribly sloppy and sounds absurd, but when I was writing I was just thinking about how all L1 and L2 regularization do is bias towards smaller weights—the models still take up the same amount of space on disk and require the same amount amount of compute to run in terms of FLOPs. But yes you’re right they make the models easier to approximate.
So actually L1/L2 regularization does allow you to compress the model by reducing entropy, as evidenced by the fact that any effective pruning/quantization system necessarily involves some strong regularizer applied during training or after.
The model itself can’t possibly know or care whether you later actually compress said weights or not, so it’s never the actual compression itself that matters, vs the inherent compressibility (which comes from the regularization).