What actually matters is finding a solution that is good enough relative to computational costs
That is true, but calling methods to find such solutions “global optimizers” is still a misnaming. Adding some noise/randomness to your local optimization does not a global optimizer make.
Of course, stochastic local-ish optimizers can be very powerful and very useful. They are just not global.
Disagree, but this may be now on the level of terminology.
By powerful global optimizer, I mean an optimizer which can find global optimums roughly using similar time/space to other techniques, and more importantly can find good enough solutions that are very close to the true global optimum but using much less time/space.
Adding some noise/randomness to your local optimization does not a global optimizer make.
Actually it does. As I pointed out above, adding the right kind of noise/randomness allows SGD to find the global optimum (given enough chances/time). This is just trivially true—if you randomly search long enough and retain the best solution found so far, you are guaranteed to eventually find the optimal solution.
Thinking about it a bit, the “global” vs “local” property is not really a property of an optimizer, as it is a combined property of the optimizer and the search space (aka the error landscape). For an entirely arbitrary search space, only the exhaustive search and equivalents will guarantee you the global optimum. For simple search spaces (e.g. quadratic optimization) easy “local” things like hill-climbing will guarantee you the global optimum. If your solution is bounded, you always know how close to the theoretical best you are and can adjust the search accordingly (e.g. terminate early), etc.
Yes. In some cases simple (convex) search spaces even permit exact non-iterative analytic solutions which are faster than any iterative approach like SGD (although curiously, they are typically not dramatically faster—as they still need to analyze each example once, and the number of full epochs/iterations in SGD type approaches appears to be sublinear. )
The more advanced methods—natural gradient, hessian free, etc—expend extra computation to analyze the local search space and then use this analysis to make an improved nonlocal jump. This always has extra cost though, so these techniques just end up tending to match SGD in terms of net optimization speed.
Looking at it another way, when a researcher uses something like a convex optimization technique to solve a problem more quickly—what actually happened is the combined human+machine system used a global optimizer to solve the problem, with more of the algorithm running in the human’s brain than on the machine. ;)
That is true, but calling methods to find such solutions “global optimizers” is still a misnaming. Adding some noise/randomness to your local optimization does not a global optimizer make.
Of course, stochastic local-ish optimizers can be very powerful and very useful. They are just not global.
Disagree, but this may be now on the level of terminology.
By powerful global optimizer, I mean an optimizer which can find global optimums roughly using similar time/space to other techniques, and more importantly can find good enough solutions that are very close to the true global optimum but using much less time/space.
Actually it does. As I pointed out above, adding the right kind of noise/randomness allows SGD to find the global optimum (given enough chances/time). This is just trivially true—if you randomly search long enough and retain the best solution found so far, you are guaranteed to eventually find the optimal solution.
Thinking about it a bit, the “global” vs “local” property is not really a property of an optimizer, as it is a combined property of the optimizer and the search space (aka the error landscape). For an entirely arbitrary search space, only the exhaustive search and equivalents will guarantee you the global optimum. For simple search spaces (e.g. quadratic optimization) easy “local” things like hill-climbing will guarantee you the global optimum. If your solution is bounded, you always know how close to the theoretical best you are and can adjust the search accordingly (e.g. terminate early), etc.
Yes. In some cases simple (convex) search spaces even permit exact non-iterative analytic solutions which are faster than any iterative approach like SGD (although curiously, they are typically not dramatically faster—as they still need to analyze each example once, and the number of full epochs/iterations in SGD type approaches appears to be sublinear. )
The more advanced methods—natural gradient, hessian free, etc—expend extra computation to analyze the local search space and then use this analysis to make an improved nonlocal jump. This always has extra cost though, so these techniques just end up tending to match SGD in terms of net optimization speed.
Looking at it another way, when a researcher uses something like a convex optimization technique to solve a problem more quickly—what actually happened is the combined human+machine system used a global optimizer to solve the problem, with more of the algorithm running in the human’s brain than on the machine. ;)