TL;DR: In a neural network with d parameters, the (local) learning coefficient λ can be upper and lower bounded by the rank of the network’s Hessian d1:
d12≤λ≤d12+d−d14.
The lower bound is a known result. The upper bound is a claim by me, and this post contains the proof for it.[1] If you find any problems, do point them out.
Edit 16.08.2024: The original version of this post had a three in the denominator of the upper bound. Dmitry Vaintrob spotted an improvement to make it a four.
Introduction
The learning coefficient λ is a measure of loss basin volume and model complexity. You can think of it sort of like an effective parameter count of the neural network. Simpler models that do less stuff have smaller λ.
Calculating λ for real networks people actually use is a pain. My hope is that these bounds help make estimating it a bit easier.
In a network with d parameters, the learning coefficient is always a number
0≤λ≤d2.
An existing result in the literature says that if you’ve calculated the rank of the network’s Hessian d1,[2] you get a tighter lower bound
d12≤λ.
I claim that we can also get a tighter upper bound
λ≤d12+d−d14,
where d−d1 will be the dimension of the Hessian kernel, meaning the number of zero eigenvalues it has.[3]
This is neat because it means we can get some idea of how large λ is using only standard linear algebra methods. All we need to know is how many zero eigenvalues the Hessian has.[4] Singular Learning Theory introductions often stress that just looking at the Hessian isn’t enough to measure basin volume correctly. But here we see that if you do it right, the Hessian eigenspectrum can give you a pretty good idea of how large λ is. Especially if there aren’t that many zero eigenvalues.
Intuitively, the lower bound works because a direction in the parameters w that isn’t free to vary to second order in the Taylor expansion won’t become any more free to vary if you pile on a bunch of higher order terms. The Second order term strictly dominates the higher order ones, they can’t cancel it out.
Qualitatively speaking, the upper bound works for the same reason. The higher order terms in the Taylor expansion of the loss can only matter so much. The Hessian is the leading term, so it can impact λ the most, adding 12 per Hessian rank to it. The remaining terms of order O(w3) and above can only add a maximum of 14 for the remaining directions.
The proof for the upper bound will just be a small modification of the proof for theorem 7.2 on pages 220 and 221 of Algebraic Geometry and Statistical Learning Theory. Maybe read that first if you want more technical context.
Some words on notation
In the following, I’ll mostly stick to the notation and conventions of the book Algebraic Geometry and Statistical Learning Theory. You can read about all the definitions there. I’m too lazy to reproduce them all.
To give some very rough context, K(w) is sort of like the ‘loss’ at parameter configuration w, φ(w) is our prior over parameters, and Z(n) is the partition function after updating on n data points.[5]
Theorem:
Let W⊂Rd be the set of parameters of the model. If there exists an open set U⊂W such that
{w∈U:K(w)=0,φ(w)>0}
is not an empty set, and we define d1= rank(H) as the rank of the Hessian H at a w0∈U
Hi,j=∂2K(w)∂wi∂wj|w=w0
with wi,wj elements of some orthonormal basis {w1,…wd} of Rd, then
λ≤d12+d−d14.
Proof:
We can assume w0=0 without loss of generality. If ϵ1,ϵ2 are sufficiently small constants,
By Theorem 7.1(1), page 218 of Algebraic Geometry and Statistical Learning Theory,[6] we also know
limn→∞Z(n)nλ(logn)m−1=c∗ , (2)
where c∗>0 is some positive constant and m is the order of the largest pole of
ζ(z)=∫K(w)zφ(w)dw.
Putting (1) and (2) together, we get
λ≤d12+d−d14.
Future work
I think it is possible to prove progressively tighter upper and lower bounds for the learning coefficient λ using information from progressively higher-order terms in the Taylor expansion of the loss.
For example, if the third and fourth-order terms involving w(2) happened to all be zero as well for a particular model, we could’ve transformed w′(2)=n16w(2) in the proof instead, and obtained the even tighter upper bound λ≤d12+d−d16.
I have some early results for an upper bound formula like this, though right now it only works under specific conditions[7].
Since the higher-dimensional tensors in these O(|w|n) terms can be very computationally intensive to deal with, the real challenge in making these progressively stricter bounds for λ practically useful would be finding tricks to make handling those tensors cheaper. A SPAR team I’m mentoring is currently trying out some ideas on that front on real networks.
Acknowledgments
Credit to Dmitry Vaintrob for spotting the improvement that tightened the upper bound from d12+d−d13 in the original version of this post to d12+d−d14. Thanks to Daniel Murfet for discussion and feedback on the proof idea.
I couldn’t find the upper bound in the literature and Daniel Murfet hadn’t heard of it either, which is why I wrote this. But it could be around somewhere and I just missed it.
For the learning coefficient, this will be the smallest Hessian rank out of all points in the set of minimum loss. For the local learning coefficient, it’ll be the smallest Hessian rank out of all minimum loss points in the local neighbourhood we specify.
This will hold for the Hessian rank at any point with minimum loss, though the upper bound from the point with the smallest Hessian rank will of course be the tightest.
The Hessian rank bounds the learning coefficient
TL;DR: In a neural network with d parameters, the (local) learning coefficient λ can be upper and lower bounded by the rank of the network’s Hessian d1:
d12≤λ≤d12+d−d14.
The lower bound is a known result. The upper bound is a claim by me, and this post contains the proof for it.[1] If you find any problems, do point them out.
Edit 16.08.2024: The original version of this post had a three in the denominator of the upper bound. Dmitry Vaintrob spotted an improvement to make it a four.
Introduction
The learning coefficient λ is a measure of loss basin volume and model complexity. You can think of it sort of like an effective parameter count of the neural network. Simpler models that do less stuff have smaller λ.
Calculating λ for real networks people actually use is a pain. My hope is that these bounds help make estimating it a bit easier.
In a network with d parameters, the learning coefficient is always a number
0≤λ≤d2.
An existing result in the literature says that if you’ve calculated the rank of the network’s Hessian d1,[2] you get a tighter lower bound
d12≤λ.
I claim that we can also get a tighter upper bound
λ≤d12+d−d14,
where d−d1 will be the dimension of the Hessian kernel, meaning the number of zero eigenvalues it has.[3]
This is neat because it means we can get some idea of how large λ is using only standard linear algebra methods. All we need to know is how many zero eigenvalues the Hessian has.[4] Singular Learning Theory introductions often stress that just looking at the Hessian isn’t enough to measure basin volume correctly. But here we see that if you do it right, the Hessian eigenspectrum can give you a pretty good idea of how large λ is. Especially if there aren’t that many zero eigenvalues.
Intuitively, the lower bound works because a direction in the parameters w that isn’t free to vary to second order in the Taylor expansion won’t become any more free to vary if you pile on a bunch of higher order terms. The Second order term strictly dominates the higher order ones, they can’t cancel it out.
Qualitatively speaking, the upper bound works for the same reason. The higher order terms in the Taylor expansion of the loss can only matter so much. The Hessian is the leading term, so it can impact λ the most, adding 12 per Hessian rank to it. The remaining terms of order O(w3) and above can only add a maximum of 14 for the remaining directions.
The proof for the upper bound will just be a small modification of the proof for theorem 7.2 on pages 220 and 221 of Algebraic Geometry and Statistical Learning Theory. Maybe read that first if you want more technical context.
Some words on notation
In the following, I’ll mostly stick to the notation and conventions of the book Algebraic Geometry and Statistical Learning Theory. You can read about all the definitions there. I’m too lazy to reproduce them all.
To give some very rough context, K(w) is sort of like the ‘loss’ at parameter configuration w, φ(w) is our prior over parameters, and Z(n) is the partition function after updating on n data points.[5]
Theorem:
Let W⊂Rd be the set of parameters of the model. If there exists an open set U⊂W such that
{w∈U:K(w)=0,φ(w)>0}
is not an empty set, and we define d1= rank(H) as the rank of the Hessian H at a w0∈U
Hi,j=∂2K(w)∂wi∂wj|w=w0
with wi,wj elements of some orthonormal basis {w1,…wd} of Rd, then
λ≤d12+d−d14.
Proof:
We can assume w0=0 without loss of generality. If ϵ1,ϵ2 are sufficiently small constants,
Z(n)=∫exp(−nK(w))φ(w)dw≥∫|w(1)|≤ϵ1,|w(2)|≤ϵ2exp{−nK(w)}φ(w)dw.
Here, w(1)∈W/ker(H),w(2)∈ker(H).
If we pick {w1,…wd} to be the Hessian eigenbasis, then for sufficiently small |w|>0
K(w)=12d1∑i,i=1Hi,iw(1)iw(1)i+O(|w(1)|2|w(2)|)+O(|w(2)|2|w(1)|)+O(|w|4).
There is no O(|w(2)|3) term because we are at a local optimum, so no direction can have a leading term that is an odd power.
For sufficiently large n, transforming Z(n) by w′(1)=n12w(1),w′(2)=n14w(2) yields
Z(n)≥n−d12n−d−d14∫|w′(1)|≤1,|w′(2)|≤1exp{−nK(w′(1),w′(2)}φ(n−12w′(1),n−14w′(2))dw′.
Rearranging and substituting gives
Z(n)nd12+d−d13≥∫|w′|≤1exp{−12∑d1i=1Hi,iw′(1)iw′(1)i+n−14O(|w′(1)|2|w′(2)|)+O(|w′(2)|2|w′(1)|)+O(|w′|4)}φ(n−12w′(1),n−14w′(2))dw′.
The right-hand side will converge to a positive constant c>0 as n→∞:
∫|w′|≤1exp{−12∑d1iHi,iw′(1)iw′(1)i+O(|w′(2)|2|w′(1)|)+O(|w′(2)|4)}φ(0)dw′=c .
Thus, we have
limn→∞Z(n)nd12+d−d14≥c . (1)
By Theorem 7.1(1), page 218 of Algebraic Geometry and Statistical Learning Theory,[6] we also know
limn→∞Z(n)nλ(logn)m−1=c∗ , (2)
where c∗>0 is some positive constant and m is the order of the largest pole of
ζ(z)=∫K(w)zφ(w)dw.
Putting (1) and (2) together, we get
λ≤d12+d−d14.
Future work
I think it is possible to prove progressively tighter upper and lower bounds for the learning coefficient λ using information from progressively higher-order terms in the Taylor expansion of the loss.
For example, if the third and fourth-order terms involving w(2) happened to all be zero as well for a particular model, we could’ve transformed w′(2)=n16w(2) in the proof instead, and obtained the even tighter upper bound λ≤d12+d−d16.
I have some early results for an upper bound formula like this, though right now it only works under specific conditions[7].
Since the higher-dimensional tensors in these O(|w|n) terms can be very computationally intensive to deal with, the real challenge in making these progressively stricter bounds for λ practically useful would be finding tricks to make handling those tensors cheaper. A SPAR team I’m mentoring is currently trying out some ideas on that front on real networks.
Acknowledgments
Credit to Dmitry Vaintrob for spotting the improvement that tightened the upper bound from d12+d−d13 in the original version of this post to d12+d−d14. Thanks to Daniel Murfet for discussion and feedback on the proof idea.
I couldn’t find the upper bound in the literature and Daniel Murfet hadn’t heard of it either, which is why I wrote this. But it could be around somewhere and I just missed it.
For the learning coefficient, this will be the smallest Hessian rank out of all points in the set of minimum loss. For the local learning coefficient, it’ll be the smallest Hessian rank out of all minimum loss points in the local neighbourhood we specify.
This will hold for the Hessian rank at any point with minimum loss, though the upper bound from the point with the smallest Hessian rank will of course be the tightest.
Or, in practice, how many Hessian eigenvalues are numerically small enough to not matter at the noise scale we choose.
I.e. the integral of our Bayesian posterior over the parameters after updating on n data points.
Careful, at least my version of the book has a typo here. I think the fraction should be the other way around from the way it’s written in the book.
Only for MSE loss, and only if we are at a global optimum where we match the output labels exactly.