Agreed that the existence of general-purpose heuristic-generators like relaxation is a strong argument for why we should expect to select for inner optimizers that look something like A*, contrary to my gradient descent doesn’t select for inner search post.
Recursive structure creates an even stronger bias toward things like A* but only in recurrent neural architectures (so notably not currently-popular transformer architectures, though it’s plausible that recurrent architectures will come back).
I maintain that the compression / compactness argument from “Risks from Learned Optimization” is wrong, at least in the current ML regime:
In general, evolved/trained/selected systems favor more compact policies/models/heuristics/algorithms/etc. In ML, for instance, the fewer parameters needed to implement the policy, the more parameters are free to vary, and therefore the more parameter-space-volume the policy takes up and the more likely it is to be found. (This is also the main argument for why overparameterized ML systems are able to generalize at all.)
I believe the standard explanation is that overparametrized ML finds generalizing models because gradient descent with weight decay finds policies that havelow L2 norm, not low description length / Kolmogorov complexity. See Neel’s recent interpretability post for an example of weight decay slowly selecting a generalizable algorithm over (non-generalizable) memorization over the course of training.
I don’t understand the parameter-space-volume argument, even after a long back-and-forth with Vladimir Nesov here. If it were true, wouldn’t we expect to be able to distill models like GPT-3 down to 10-100x fewer parameters? In practice we see maybe 2x distillation before dramatic performance losses, meaning most of those parameters really are essential to the learned policy.
Overall though this post updated me substantially towards expecting the emergence of inner A*-like algorithms, despite their computational overhead. Added it to the list of caveats in my post.
I believe the standard explanation is that overparametrized ML finds generalizing models because gradient descent with weight decay finds policies that havelow L2 norm, not low description length / Kolmogorov complexity.
I have some math that hints that those may be equivalent-ish statements.
I don’t understand the parameter-space-volume argument, even after a long back-and-forth with Vladimir Nesov here. If it were true, wouldn’t we expect to be able to distill models like GPT-3 down to 10-100x fewer parameters?
Why would we expect a 10x times distillation factor? Half the directions of the basin being completely flat seems like a pretty big optimum to me.
Also, I’m not sure if you can always manage to align the free directions in parameter space with individual parameters, such that you can discard p parameters if you had p free directions.
There should be a post with some of it out soon-ish. Short summary:
You can show that at least for overparametrised neural networks, the eigenvalues of the Hessian of the loss function at optima, which determine the basin size within some approximation radius, are basically given by something like the number of independent, orthogonal features the network has, and how “big” these features are.
The less independent, mutually orthogonal features the network has, and the smaller they are, the broader the optimum will be. Size and orthogonality are given by the Hilbert space scalar product for functions here.
That sure sounds an awful lot like a kind of complexity measure to me. Not sure it’s Kolmogorov exactly, but it does seem like something related.
And while I haven’t formalised it yet, I think there’s quite a lot to suggest that the less information you pass around in the network, the less independent features you’ll tend to have. E.g., if you have 20 independent bits of input information, and you only pass on 10 of them to the deeper layers of the network, you’ll be much more likely to get fewer unique features than if you’d passed them on. Because you’re making the Hilbert space smaller.
So if you introduce a penalty on exchanging too much information between parts of the network, like, say, with L2 regularisation, you’d expect the optimiser to find solutions with less independent features (“description length”), and broader basins.
Empirically, introducing “connection costs” does seem to lead to broader basins in our group’s experiments, IIRC. Also, there’s a bunch of bio papers on how connection costs lead to modularity, and our own experiments support the idea that modularity means broader basins. I’m not sure I’ve seen it implemented with L2 regularisation as the connection cost specifically, but my guess would be that it’d do the same thing.
(Our hope is actually that these orthogonalised features might prove to be a better fundamental unit of DL theory and interpretability than neurons, but we haven’t gotten to testing that yet)
Agreed that the existence of general-purpose heuristic-generators like relaxation is a strong argument for why we should expect to select for inner optimizers that look something like A*, contrary to my gradient descent doesn’t select for inner search post.
Recursive structure creates an even stronger bias toward things like A* but only in recurrent neural architectures (so notably not currently-popular transformer architectures, though it’s plausible that recurrent architectures will come back).
I maintain that the compression / compactness argument from “Risks from Learned Optimization” is wrong, at least in the current ML regime:
I believe the standard explanation is that overparametrized ML finds generalizing models because gradient descent with weight decay finds policies that have low L2 norm, not low description length / Kolmogorov complexity. See Neel’s recent interpretability post for an example of weight decay slowly selecting a generalizable algorithm over (non-generalizable) memorization over the course of training.
I don’t understand the parameter-space-volume argument, even after a long back-and-forth with Vladimir Nesov here. If it were true, wouldn’t we expect to be able to distill models like GPT-3 down to 10-100x fewer parameters? In practice we see maybe 2x distillation before dramatic performance losses, meaning most of those parameters really are essential to the learned policy.
Overall though this post updated me substantially towards expecting the emergence of inner A*-like algorithms, despite their computational overhead. Added it to the list of caveats in my post.
I have some math that hints that those may be equivalent-ish statements.
Why would we expect a 10x times distillation factor? Half the directions of the basin being completely flat seems like a pretty big optimum to me.
Also, I’m not sure if you can always manage to align the free directions in parameter space with individual parameters, such that you can discard p parameters if you had p free directions.
Would love to see your math! If L2 norm and Kolmogorov provide roughly equivalent selection pressure that’s definitely a crux for me.
There should be a post with some of it out soon-ish. Short summary:
You can show that at least for overparametrised neural networks, the eigenvalues of the Hessian of the loss function at optima, which determine the basin size within some approximation radius, are basically given by something like the number of independent, orthogonal features the network has, and how “big” these features are.
The less independent, mutually orthogonal features the network has, and the smaller they are, the broader the optimum will be. Size and orthogonality are given by the Hilbert space scalar product for functions here.
That sure sounds an awful lot like a kind of complexity measure to me. Not sure it’s Kolmogorov exactly, but it does seem like something related.
And while I haven’t formalised it yet, I think there’s quite a lot to suggest that the less information you pass around in the network, the less independent features you’ll tend to have. E.g., if you have 20 independent bits of input information, and you only pass on 10 of them to the deeper layers of the network, you’ll be much more likely to get fewer unique features than if you’d passed them on. Because you’re making the Hilbert space smaller.
So if you introduce a penalty on exchanging too much information between parts of the network, like, say, with L2 regularisation, you’d expect the optimiser to find solutions with less independent features (“description length”), and broader basins.
Empirically, introducing “connection costs” does seem to lead to broader basins in our group’s experiments, IIRC. Also, there’s a bunch of bio papers on how connection costs lead to modularity, and our own experiments support the idea that modularity means broader basins. I’m not sure I’ve seen it implemented with L2 regularisation as the connection cost specifically, but my guess would be that it’d do the same thing.
(Our hope is actually that these orthogonalised features might prove to be a better fundamental unit of DL theory and interpretability than neurons, but we haven’t gotten to testing that yet)