One way to think about gradient descent is that if there’s an N-dimensional parameter space on which you build a grid of M^N points and do a grid search for the minimum, well, you can accomplish the same selection task with O(M) steps of gradient descent (at least if there’s no local minimum). So should we say does that M steps of gradient descent gives you O(N log M) bits of optimization? Or something like that? I’m not sure.
I like that intuition overall, but there’s a sense that adaptive search does give far more resolution than grid search, but the analysis seems wrong; if I use gradient descent, by eventual accuracy is the step size near the end, which gives far more precise answers than “only” checking a grid of M^N points spaced equally.
One way to think about gradient descent is that if there’s an N-dimensional parameter space on which you build a grid of M^N points and do a grid search for the minimum, well, you can accomplish the same selection task with O(M) steps of gradient descent (at least if there’s no local minimum). So should we say does that M steps of gradient descent gives you O(N log M) bits of optimization? Or something like that? I’m not sure.
I like that intuition overall, but there’s a sense that adaptive search does give far more resolution than grid search, but the analysis seems wrong; if I use gradient descent, by eventual accuracy is the step size near the end, which gives far more precise answers than “only” checking a grid of M^N points spaced equally.