There should be a post with some of it out soon-ish. Short summary:
You can show that at least for overparametrised neural networks, the eigenvalues of the Hessian of the loss function at optima, which determine the basin size within some approximation radius, are basically given by something like the number of independent, orthogonal features the network has, and how “big” these features are.
The less independent, mutually orthogonal features the network has, and the smaller they are, the broader the optimum will be. Size and orthogonality are given by the Hilbert space scalar product for functions here.
That sure sounds an awful lot like a kind of complexity measure to me. Not sure it’s Kolmogorov exactly, but it does seem like something related.
And while I haven’t formalised it yet, I think there’s quite a lot to suggest that the less information you pass around in the network, the less independent features you’ll tend to have. E.g., if you have 20 independent bits of input information, and you only pass on 10 of them to the deeper layers of the network, you’ll be much more likely to get fewer unique features than if you’d passed them on. Because you’re making the Hilbert space smaller.
So if you introduce a penalty on exchanging too much information between parts of the network, like, say, with L2 regularisation, you’d expect the optimiser to find solutions with less independent features (“description length”), and broader basins.
Empirically, introducing “connection costs” does seem to lead to broader basins in our group’s experiments, IIRC. Also, there’s a bunch of bio papers on how connection costs lead to modularity, and our own experiments support the idea that modularity means broader basins. I’m not sure I’ve seen it implemented with L2 regularisation as the connection cost specifically, but my guess would be that it’d do the same thing.
(Our hope is actually that these orthogonalised features might prove to be a better fundamental unit of DL theory and interpretability than neurons, but we haven’t gotten to testing that yet)
Would love to see your math! If L2 norm and Kolmogorov provide roughly equivalent selection pressure that’s definitely a crux for me.
There should be a post with some of it out soon-ish. Short summary:
You can show that at least for overparametrised neural networks, the eigenvalues of the Hessian of the loss function at optima, which determine the basin size within some approximation radius, are basically given by something like the number of independent, orthogonal features the network has, and how “big” these features are.
The less independent, mutually orthogonal features the network has, and the smaller they are, the broader the optimum will be. Size and orthogonality are given by the Hilbert space scalar product for functions here.
That sure sounds an awful lot like a kind of complexity measure to me. Not sure it’s Kolmogorov exactly, but it does seem like something related.
And while I haven’t formalised it yet, I think there’s quite a lot to suggest that the less information you pass around in the network, the less independent features you’ll tend to have. E.g., if you have 20 independent bits of input information, and you only pass on 10 of them to the deeper layers of the network, you’ll be much more likely to get fewer unique features than if you’d passed them on. Because you’re making the Hilbert space smaller.
So if you introduce a penalty on exchanging too much information between parts of the network, like, say, with L2 regularisation, you’d expect the optimiser to find solutions with less independent features (“description length”), and broader basins.
Empirically, introducing “connection costs” does seem to lead to broader basins in our group’s experiments, IIRC. Also, there’s a bunch of bio papers on how connection costs lead to modularity, and our own experiments support the idea that modularity means broader basins. I’m not sure I’ve seen it implemented with L2 regularisation as the connection cost specifically, but my guess would be that it’d do the same thing.
(Our hope is actually that these orthogonalised features might prove to be a better fundamental unit of DL theory and interpretability than neurons, but we haven’t gotten to testing that yet)