Jobst Heitzig comments on How To Think About Overparameterized Models

Jobst Heitzig 6 Jul 2023 22:14 UTC
1 point
roughly speaking, we gradient-descend our way to whatever point on the perfect-prediction surface is closest to our initial values.
I believe this is not correct as long as “gradient-descend” means some standard version of gradient descent because those are all local, can go highly nonlinear paths, and do not memorize the initial value to try staying close to it.
But maybe we can design a local search strategy similar to gradient descent which does try to stay close to the initial point x0? E.g., if at x, go a small step into a direction that has the minimal scalar product with x – x0 among those that have at most an angle of alpha with the current gradient, where alpha>0 is a hyperparameter. One might call this “stochastic cone descent” if it does not yet have a name.