Architecture-aware optimisation: train ImageNet and more without hyperparameters

A deep learning system is composed of lots of interrelated components: architecture, data, loss function and gradients. There is a structure in the way these components interact—however, the most popular optimisers (e.g. Adam and SGD) do not utilise this information. This means there are leftover degrees of freedom in the optimisation process—which we currently have to take care of via manually tuning their hyperparameters (most importantly, the learning rate). If we could characterise these interactions perfectly, we could remove all degrees of freedom, and thus remove the need for hyperparameters.

Second-order methods characterise the sensitivity of the objective to weight perturbations using implicit architectural information via the Hessian, and remove degrees of freedom that way. However, such methods can be computationally intensive and thus not practical for large models.

I worked with Jeremy Bernstein on leveraging explicit architectural information to produce a new first-order optimisation algorithm: Automatic Gradient Descent (AGD). With computational complexity no greater than SGD, AGD trained all architectures and datasets we threw at it without needing any hyperparameters: from a 2-layer FCN on CIFAR10 to ResNet50 on ImageNet. Where tested, AGD achieved comparable test accuracy to tuned Adam and SGD.

Anyone interested in the derivation, PyTorch code, or experiments might be interested in any of the following links, or the summary figure below.

  • Here is a link to a blog post I wrote summarising the paper.

  • Here is a link to the paper

  • Here is a link to the official GitHub

  • Here is a link to an experimental GitHub where we test AGD on systems not yet in the paper (including language models).

Solid lines show train accuracy and dotted lines show test accuracy. Left: In contrast to our method, Adam and SGD with default hyperparameters perform poorly on a deep fully connected network (FCN) on CIFAR-10. Middle: A learning rate grid search for Adam and SGD. Our optimiser performs about as well as fully-tuned Adam and SGD. Right: AGD trains ImageNet to a respectable test accuracy.

Hopefully, the ideas in the paper will form the basis of a more complete understanding of optimisation in neural networks—as discussed in the paper, there are a few applications that need to be fully fleshed out. The derivation relies on an architectural perturbation bound (bounding the sensitivity of the function to changes in weights) based on a fully connected network with linear activations and no bias terms—however, empirically it works extremely well. Our experiments therefore did not use bias terms, nor affine parameters.

However, the version of AGD in the experimental GitHub supports 1D parameters like bias terms and affine parameters (implemented in the most obvious way, although requiring further theoretical justification), and preliminary experiments indicate good performance. Preliminary experiments on GPT2-scale language models on OpenWebText2 are also promising.

If anyone has any feedback or suggestions, please let me know!