Not strictly true. Simply by overparameterizing the ANN, you can emulate parallel gradient descent (global optimization). Stochastic techniques—such as dropout—and tricks such as momentum can also help tunnel through local optima. In addition, the typical optimization hurdles tend to be saddlepoints rather than local optima.
Modern SGD mechanisms are powerful global optimizers.
I mean, the fact that neural nets didn’t give us strong AI back in the 70s demonstrates that they are not doing anything close to Solomonoff induction.
It demonstrates nothing of the sort, but such a demonstration would be pointless anyway. Solomonoff induction is completely worthless—intractable—so you absolutely don’t want to do that anyway.
They [modern SGD systems] are heuristic optimizers that have no guarantees of finding a global optimum. It’s strange to call them “powerful global optimizers”.
Not at all strange. As I pointed out in the post above—modern SGD functions as a global optimizer due to clever tricks such as over-parameterization, dropout, and momentum. These optimizers are also suprisingly powerful as measured in terms of actual solution quality vs computational cost—indeed they dominate most of the landscape. It is completely disingenous to call the dominate solution anything other than powerful.
In practice strong guarantees are not important. What actually matters is finding a solution that is good enough relative to computational costs, where ‘good enough’ is determined according to some more complex business/economic utility metric, where solution quality follows diminishing returns. For sufficiently complex problems, finding the actual global optimum is rarely ideal—it is far too far to the right on the asymptotic diminishing returns curve.
What actually matters is finding a solution that is good enough relative to computational costs
That is true, but calling methods to find such solutions “global optimizers” is still a misnaming. Adding some noise/randomness to your local optimization does not a global optimizer make.
Of course, stochastic local-ish optimizers can be very powerful and very useful. They are just not global.
Disagree, but this may be now on the level of terminology.
By powerful global optimizer, I mean an optimizer which can find global optimums roughly using similar time/space to other techniques, and more importantly can find good enough solutions that are very close to the true global optimum but using much less time/space.
Adding some noise/randomness to your local optimization does not a global optimizer make.
Actually it does. As I pointed out above, adding the right kind of noise/randomness allows SGD to find the global optimum (given enough chances/time). This is just trivially true—if you randomly search long enough and retain the best solution found so far, you are guaranteed to eventually find the optimal solution.
Thinking about it a bit, the “global” vs “local” property is not really a property of an optimizer, as it is a combined property of the optimizer and the search space (aka the error landscape). For an entirely arbitrary search space, only the exhaustive search and equivalents will guarantee you the global optimum. For simple search spaces (e.g. quadratic optimization) easy “local” things like hill-climbing will guarantee you the global optimum. If your solution is bounded, you always know how close to the theoretical best you are and can adjust the search accordingly (e.g. terminate early), etc.
Yes. In some cases simple (convex) search spaces even permit exact non-iterative analytic solutions which are faster than any iterative approach like SGD (although curiously, they are typically not dramatically faster—as they still need to analyze each example once, and the number of full epochs/iterations in SGD type approaches appears to be sublinear. )
The more advanced methods—natural gradient, hessian free, etc—expend extra computation to analyze the local search space and then use this analysis to make an improved nonlocal jump. This always has extra cost though, so these techniques just end up tending to match SGD in terms of net optimization speed.
Looking at it another way, when a researcher uses something like a convex optimization technique to solve a problem more quickly—what actually happened is the combined human+machine system used a global optimizer to solve the problem, with more of the algorithm running in the human’s brain than on the machine. ;)
Not strictly true. Simply by overparameterizing the ANN, you can emulate parallel gradient descent (global optimization). Stochastic techniques—such as dropout—and tricks such as momentum can also help tunnel through local optima. In addition, the typical optimization hurdles tend to be saddlepoints rather than local optima.
Modern SGD mechanisms are powerful global optimizers.
It demonstrates nothing of the sort, but such a demonstration would be pointless anyway. Solomonoff induction is completely worthless—intractable—so you absolutely don’t want to do that anyway.
They are heuristic optimizers that have no guarantees of finding a global optimum. It’s strange to call them “powerful global optimizers”.
I believe that was my point.
Not at all strange. As I pointed out in the post above—modern SGD functions as a global optimizer due to clever tricks such as over-parameterization, dropout, and momentum. These optimizers are also suprisingly powerful as measured in terms of actual solution quality vs computational cost—indeed they dominate most of the landscape. It is completely disingenous to call the dominate solution anything other than powerful.
In practice strong guarantees are not important. What actually matters is finding a solution that is good enough relative to computational costs, where ‘good enough’ is determined according to some more complex business/economic utility metric, where solution quality follows diminishing returns. For sufficiently complex problems, finding the actual global optimum is rarely ideal—it is far too far to the right on the asymptotic diminishing returns curve.
That is true, but calling methods to find such solutions “global optimizers” is still a misnaming. Adding some noise/randomness to your local optimization does not a global optimizer make.
Of course, stochastic local-ish optimizers can be very powerful and very useful. They are just not global.
Disagree, but this may be now on the level of terminology.
By powerful global optimizer, I mean an optimizer which can find global optimums roughly using similar time/space to other techniques, and more importantly can find good enough solutions that are very close to the true global optimum but using much less time/space.
Actually it does. As I pointed out above, adding the right kind of noise/randomness allows SGD to find the global optimum (given enough chances/time). This is just trivially true—if you randomly search long enough and retain the best solution found so far, you are guaranteed to eventually find the optimal solution.
Thinking about it a bit, the “global” vs “local” property is not really a property of an optimizer, as it is a combined property of the optimizer and the search space (aka the error landscape). For an entirely arbitrary search space, only the exhaustive search and equivalents will guarantee you the global optimum. For simple search spaces (e.g. quadratic optimization) easy “local” things like hill-climbing will guarantee you the global optimum. If your solution is bounded, you always know how close to the theoretical best you are and can adjust the search accordingly (e.g. terminate early), etc.
Yes. In some cases simple (convex) search spaces even permit exact non-iterative analytic solutions which are faster than any iterative approach like SGD (although curiously, they are typically not dramatically faster—as they still need to analyze each example once, and the number of full epochs/iterations in SGD type approaches appears to be sublinear. )
The more advanced methods—natural gradient, hessian free, etc—expend extra computation to analyze the local search space and then use this analysis to make an improved nonlocal jump. This always has extra cost though, so these techniques just end up tending to match SGD in terms of net optimization speed.
Looking at it another way, when a researcher uses something like a convex optimization technique to solve a problem more quickly—what actually happened is the combined human+machine system used a global optimizer to solve the problem, with more of the algorithm running in the human’s brain than on the machine. ;)