The idea that using dropout makes models simpler is not intuitive to me because according to Hinton dropout essentially does the same thing as ensembling. If what you end up with is something equivalent to an ensemble of smaller networks than it’s not clear to me that would be easier to prune.
One of the papers you linked to appears to study dropout in the context of Bayesian modeling and they argue it encourages sparsity. I’m willing to buy that it does in fact reduce complexity/ compressibility but I’m also not sure any of this is 100% clear cut.
It’s not that dropout provides some ensembling secret sauce; instead neural nets are inherently ensembles proportional to their level of overcompleteness. Dropout (like other regularizers) helps ensure they are ensembles of low complexity sub-models, rather than ensembles of over-fit higher complexity sub-models (see also: lottery tickets, pruning, grokking, double descent).
By the way, if you look at Filan et al.’s paper “Clusterability in Neural Networks” there is a lot of variance in their results but generally speaking they find that L1 regularization leads to slightly more clusterability than L2 or dropout.
The idea that using dropout makes models simpler is not intuitive to me because according to Hinton dropout essentially does the same thing as ensembling. If what you end up with is something equivalent to an ensemble of smaller networks than it’s not clear to me that would be easier to prune.
One of the papers you linked to appears to study dropout in the context of Bayesian modeling and they argue it encourages sparsity. I’m willing to buy that it does in fact reduce complexity/ compressibility but I’m also not sure any of this is 100% clear cut.
It’s not that dropout provides some ensembling secret sauce; instead neural nets are inherently ensembles proportional to their level of overcompleteness. Dropout (like other regularizers) helps ensure they are ensembles of low complexity sub-models, rather than ensembles of over-fit higher complexity sub-models (see also: lottery tickets, pruning, grokking, double descent).
By the way, if you look at Filan et al.’s paper “Clusterability in Neural Networks” there is a lot of variance in their results but generally speaking they find that L1 regularization leads to slightly more clusterability than L2 or dropout.