Seems like gradient descent methods weren’t using the relevant math bounds so far. Google released AutoBound as an open-source library.
Here is what I consider a money shot of the article (notice it’s a log-plot):
Performance of SafeRate when used to train a single-hidden-layer neural network on a subset of the MNIST dataset, in the full-batch setting.
Hopefully, they are just overfitting on MNIST. Otherwise, it pattern-matches to a huge advance. Their repo implies that with float64 this scales to larger neural networks. LLMs seem to reliably get new capabilities with lower loss, at least.
What do you think?
Here are related technical details:
Optimizers that use upper bounds in this way are called majorization-minimization (MM) optimizers. Applied to one-dimensional logistic regression, AutoBound rederives an MM optimizer first published in 2009. Applied to more complex problems, AutoBound derives novel MM optimizers that would be difficult to derive by hand.
We can use a similar idea to take an existing optimizer such as Adam and convert it to a hyperparameter-free optimizer that is guaranteed to monotonically reduce the loss (in the full-batch setting). The resulting optimizer uses the same update direction as the original optimizer, but modifies the learning rate by minimizing a one-dimensional quadratic upper bound derived by AutoBound. We refer to the resulting meta-optimizer as SafeRate.
Using SafeRate, we can create more robust variants of existing optimizers, at the cost of a single additional forward pass that increases the wall time for each step by a small factor (about 2x slower in the example above).
This seems novel to neural network training, or am I missing something that Bayesian neural net people have been doing already?
AutoBound on neural network can achieve OOMs lower training loss
Link post
Seems like gradient descent methods weren’t using the relevant math bounds so far. Google released AutoBound as an open-source library.
Here is what I consider a money shot of the article (notice it’s a log-plot):
Hopefully, they are just overfitting on MNIST. Otherwise, it pattern-matches to a huge advance. Their repo implies that with float64 this scales to larger neural networks. LLMs seem to reliably get new capabilities with lower loss, at least.
What do you think?
Here are related technical details:
This seems novel to neural network training, or am I missing something that Bayesian neural net people have been doing already?