I only briefly skimmed this post, so I don’t know if you cover this, but I wonder what you make of the various lines of research indicating that SGD on deep neural networks already implements a minimal depth inductive bias? E.g.:
I know they call it a “simplicity bias”, but they don’t mean anything like minimum message length simplicity. They actually mean function rank, so e.g., the identity function would be maximally “complex” under their notion of simplicity.
This becomes a sort of minimal depth inductive bias because circuits that sequentially multiply more matrices suffer from worse rank collapse, until they can’t implement high rank functions.
This paper argues that, for residual image nets at least, most of the decision-relevant computations are implemented through pathways that are significantly shallower than the full depth of the network.
It seems like investigating and intervening on the mechanisms behind such an inductive bias would be the most straightforward way to tune a network’s degree of speed bias.
I only briefly skimmed this post, so I don’t know if you cover this, but I wonder what you make of the various lines of research indicating that SGD on deep neural networks already implements a minimal depth inductive bias? E.g.:
On the Implicit Bias Towards Minimal Depth of Deep Neural Networks
When you use a network that’s deeper than required to solve the problem, the trained network seems to mostly just ignore the additional depth.
The Low-Rank Simplicity Bias in Deep Networks
I know they call it a “simplicity bias”, but they don’t mean anything like minimum message length simplicity. They actually mean function rank, so e.g., the identity function would be maximally “complex” under their notion of simplicity.
This becomes a sort of minimal depth inductive bias because circuits that sequentially multiply more matrices suffer from worse rank collapse, until they can’t implement high rank functions.
Residual Networks Behave Like Ensembles of Relatively Shallow Networks
This paper argues that, for residual image nets at least, most of the decision-relevant computations are implemented through pathways that are significantly shallower than the full depth of the network.
It seems like investigating and intervening on the mechanisms behind such an inductive bias would be the most straightforward way to tune a network’s degree of speed bias.