The quality of a neural network comes from its size, shape, and training data, but not from the training function, which is always simple gradient descent.
Only if you consider modern variants of batch Adam with momentum, regularization, etc to be ‘simple gradient descent’.
Regardless SGD techniques are reasonable approximations to bayesian updating with numerous statistical limiting assumptions, which fully explains why they work when they do. (And the specific limiting assumptions in said approximation sufficiently explain the various scenarios when/where SGD notoriously fails—ie handling non-unit variance distributions, etc).
Most of the other possibilities (higher order techniques) trade off computational efficiency for convergence speed or stability, and it just happens that for many economically important workloads any convergence benefits of more complex methods generally aren’t worth the extra compute cost; it’s instead better to spend that compute on more training or a larger model instead.
I suspect that eventually will change, but only when/if we have non-trivial advances in the relevant efficient GPU codes.
Here’s a good related reddit thread on proximal-point based alternatives to gradient methods.
Only if you consider modern variants of batch Adam with momentum, regularization, etc to be ‘simple gradient descent’.
Regardless SGD techniques are reasonable approximations to bayesian updating with numerous statistical limiting assumptions, which fully explains why they work when they do. (And the specific limiting assumptions in said approximation sufficiently explain the various scenarios when/where SGD notoriously fails—ie handling non-unit variance distributions, etc).
Most of the other possibilities (higher order techniques) trade off computational efficiency for convergence speed or stability, and it just happens that for many economically important workloads any convergence benefits of more complex methods generally aren’t worth the extra compute cost; it’s instead better to spend that compute on more training or a larger model instead.
I suspect that eventually will change, but only when/if we have non-trivial advances in the relevant efficient GPU codes.
Here’s a good related reddit thread on proximal-point based alternatives to gradient methods.