The Hessian rank bounds the learning coefficient

TL;DR: In a neural network with parameters, the (local) learning coefficient can be upper and lower bounded by the rank of the network’s Hessian :

.

The lower bound is a known result. The upper bound is a claim by me, and this post contains the proof for it.[1] If you find any problems, do point them out.

Edit 16.08.2024: The original version of this post had a three in the denominator of the upper bound. Dmitry Vaintrob spotted an improvement to make it a four.

Introduction

The learning coefficient is a measure of loss basin volume and model complexity. You can think of it sort of like an effective parameter count of the neural network. Simpler models that do less stuff have smaller .

Calculating for real networks people actually use is a pain. My hope is that these bounds help make estimating it a bit easier.

In a network with parameters, the learning coefficient is always a number

.

An existing result in the literature says that if you’ve calculated the rank of the network’s Hessian ,[2] you get a tighter lower bound

I claim that we can also get a tighter upper bound

,

where will be the dimension of the Hessian kernel, meaning the number of zero eigenvalues it has.[3]

This is neat because it means we can get some idea of how large is using only standard linear algebra methods. All we need to know is how many zero eigenvalues the Hessian has.[4] Singular Learning Theory introductions often stress that just looking at the Hessian isn’t enough to measure basin volume correctly. But here we see that if you do it right, the Hessian eigenspectrum can give you a pretty good idea of how large is. Especially if there aren’t that many zero eigenvalues.

Intuitively, the lower bound works because a direction in the parameters that isn’t free to vary to second order in the Taylor expansion won’t become any more free to vary if you pile on a bunch of higher order terms. The Second order term strictly dominates the higher order ones, they can’t cancel it out.

Qualitatively speaking, the upper bound works for the same reason. The higher order terms in the Taylor expansion of the loss can only matter so much. The Hessian is the leading term, so it can impact the most, adding per Hessian rank to it. The remaining terms of order and above can only add a maximum of for the remaining directions.

The proof for the upper bound will just be a small modification of the proof for theorem 7.2 on pages 220 and 221 of Algebraic Geometry and Statistical Learning Theory. Maybe read that first if you want more technical context.

Some words on notation

In the following, I’ll mostly stick to the notation and conventions of the book Algebraic Geometry and Statistical Learning Theory. You can read about all the definitions there. I’m too lazy to reproduce them all.

To give some very rough context, is sort of like the ‘loss’ at parameter configuration , is our prior over parameters, and is the partition function after updating on data points.[5]

Theorem:

Let be the set of parameters of the model. If there exists an open set such that

is not an empty set, and we define rank as the rank of the Hessian at a

with elements of some orthonormal basis of , then

.

Proof:

We can assume without loss of generality. If are sufficiently small constants,

Here, .

If we pick to be the Hessian eigenbasis, then for sufficiently small

.

There is no term because we are at a local optimum, so no direction can have a leading term that is an odd power.

For sufficiently large , transforming by yields

Rearranging and substituting gives

The right-hand side will converge to a positive constant as :

.

Thus, we have

. (1)

By Theorem 7.1(1), page 218 of Algebraic Geometry and Statistical Learning Theory,[6] we also know

, (2)

where is some positive constant and is the order of the largest pole of

.

Putting (1) and (2) together, we get

.

Future work

I think it is possible to prove progressively tighter upper and lower bounds for the learning coefficient using information from progressively higher-order terms in the Taylor expansion of the loss.

For example, if the third and fourth-order terms involving happened to all be zero as well for a particular model, we could’ve transformed in the proof instead, and obtained the even tighter upper bound .

I have some early results for an upper bound formula like this, though right now it only works under specific conditions[7].

Since the higher-dimensional tensors in these terms can be very computationally intensive to deal with, the real challenge in making these progressively stricter bounds for practically useful would be finding tricks to make handling those tensors cheaper. A SPAR team I’m mentoring is currently trying out some ideas on that front on real networks.

Acknowledgments

Credit to Dmitry Vaintrob for spotting the improvement that tightened the upper bound from in the original version of this post to . Thanks to Daniel Murfet for discussion and feedback on the proof idea.

  1. ^

    I couldn’t find the upper bound in the literature and Daniel Murfet hadn’t heard of it either, which is why I wrote this. But it could be around somewhere and I just missed it.

  2. ^

    For the learning coefficient, this will be the smallest Hessian rank out of all points in the set of minimum loss. For the local learning coefficient, it’ll be the smallest Hessian rank out of all minimum loss points in the local neighbourhood we specify.

  3. ^

    This will hold for the Hessian rank at any point with minimum loss, though the upper bound from the point with the smallest Hessian rank will of course be the tightest.

  4. ^

    Or, in practice, how many Hessian eigenvalues are numerically small enough to not matter at the noise scale we choose.

  5. ^

    I.e. the integral of our Bayesian posterior over the parameters after updating on data points.

  6. ^

    Careful, at least my version of the book has a typo here. I think the fraction should be the other way around from the way it’s written in the book.

  7. ^

    Only for MSE loss, and only if we are at a global optimum where we match the output labels exactly.