Thanks to Thomas Kwa for the question which prompted this post.
Note: This is mostly a primer / introductory reference, not a research post. However, the details should be interesting even to those familiar with the area.
When discussing “broad basins” in the loss landscape of a DNN, the Hessian of loss is often referenced. This post explains a simple theoretical approximation of basin volume which uses the Hessian of loss.[1]
Suppose our minimum has loss=0. Define the basin as the region of parameter space draining to our minimum where loss < threshold T.[2]
Simplest model
If all eigenvalues of the Hessian are positive and non trivial,[3] we can approximate the loss as a paraboloid centered on our minimum:
The vertical axis is loss, and the horizontal plane is parameter space. The shape of the basin in parameter space is the shadow of this paraboloid, which is an ellipsoid.
The principal directions of curvature of the paraboloid are given by the eigenvectors of the Hessian. The curvatures (second derivative) in each of those directions is given by the corresponding eigenvalue.
Radii of the ellipsoid: If we start at our minimum and walk away in a principal direction, the loss as a function of distance traveled is L(x)=12λix2, where λi is the Hessian eigenvalue for that direction. So given our loss threshold T, we will hit that threshold at a distance of x=√2Tλi. This is the radius of the loss-basin ellipsoid in that direction.
The volume of the ellipsoid is Vbasin=Vn∏i√2Tλi, where the constant Vn is the volume of the unit n-ball in n dimensions. Since the product of the eigenvalues is the determinant of the Hessian, we can write this as:
Vbasin=Vn(2T)n/2√det[Hessian]
So the basin volume is inversely proportional to the square root of the determinant of the Hessian. Everything in the numerator is a constant, so only the determinant of the Hessian matters in this model.
The problem with this model is that the determinant of the Hessian is usually zero, due to zero eigenvalues.
Fixing the model
If we don’t include a regularization term in the loss, the basin as we defined it earlier can actually be infinitely big (it’s not just a problem with the paraboloid model). However, we don’t really care about volume that is so far from the origin that it is never reached.
A somewhat principled way to fix the model is to look at volume weighted bythe initialization distribution. This is easiest to work with if the initialization is Gaussian. To make the math tractable, we can replace our ellipsoid with a “fuzzy ellipsoid”—i.e. a multivariate Gaussian. Now we just have to integrate the product of two Gaussians, which should be easy. There are also somewhat principled reasons for using a “fuzzy ellipsoid”, which I won’t explain here.
However, this is only somewhat principled; if you think about it further, it starts to become unclear: Should we use the initialization Gaussian, or one based on the expected final L2 norm? What about cases where the norm peaks in the middle of training, and is smaller at the start and finish?
If we have an L2 regularization term in the loss, then the infinite volume problem usually goes away; the L2 term makes all the eigenvalues positive, so the formula is fine. If we have weight decay, we can interpret this as L2 regularization and add it to the loss.
For a relatively simple approximation, I recommend the formula:
Vbasin=Vn(2T)n/2√det[Hessian(Loss)+(λ+c)In]
Where:
Loss is the unregularized loss
λ is the amount of weight decay (or L2 regularization 12λ||θ||2)
c=k/σ2, where σ is the standard deviation of the initialization Gaussian, and k is a constant on the order of unity. I have not calculated the theoretically most appropriate value of k. For a crude model, k=1 is probably good enough.
T is the loss threshold. If you really care about the absolute volume, you can try to set T empirically by looking at where the paraboloid approximation breaks. If you only care about volume relative to other basins, you can ignore T since it’s a constant.
Estimation in practice
If the DNN of interest is large (>10k params for instance), the Hessian becomes very unwieldy.[4] Luckily, it is possible to efficiently estimate the quantity det[Hessian(Loss)+(λ+c)In] without ever computing the Hessian.
One correct[5] method of doing this is to get the eigenvalue spectrum of the Hessian using stochastic Lanczos quadrature.[6] Then shift the spectrum up by λ+c and estimate the product.
Roasting the literature a bit
The “easy way out” is to use the trace of the Hessian instead of the determinant. This is extremely easy to estimate: Just sample the second derivative in random directions, and the average value is proportional to the trace. The problem is that the trace is simply the wrong measure, and is probably a somewhat poor proxy for the determinant.
Most (all?) of the flatness and volume measures I have seen in the literature are actually tracking the trace. There is one (Keskar et. al.)[7] which seems to be correcting in the wrong direction (increasing the influence of large eigenvalues relative to the trace, when it should be doing the opposite).[8] There is another which samples ellipsoid radius in random directions and calculates the volume of an ellipsoid slice in that direction (which is proportional to rn). While this is technically an unbiased estimator for finite ellipsoids, it has two problems in practice:[9]
The ellipsoid is usually actually infinite, which means the method is sampling to estimate an infinite quantity. (Predictably, the median estimate goes up without bound as we increase the number of samples.)
There are far too few samples to get a good estimate of the determinant, and the thing which is tracked in practice is quite trace-like.
Information theory
How many bits does it take to specify (locate) a loss basin?
The simplest answer is −log2(V), where V is the initialization-weighted volume of the basin. The weighting is done such that it integrates to 1.
Warning: The math is very hard to understand. I think library implementations exist online; I have not used them though. If you try implementing it yourself, it will probably be a massive pain.
I have confirmed with simulations that it is flawed for very large n. Doing the equivalent of our (λ+c)In correction fixes the first issue but not the second.
Summary of the first two sections: You can approximate the loss as a paraboloid, which gives you an ellipsoid as the basin. The eigenvalues of the Hessian of loss give you the curvatures. The volume of the ellipsoid is proportional to 1√det[Hessian] (recall that determinant = product of eigenvalues). This doesn’t actually work because the eigenvalues can be zero. You can fix this by adding a constant to every eigenvalue.
Hessian and Basin volume
Thanks to Thomas Kwa for the question which prompted this post.
Note: This is mostly a primer / introductory reference, not a research post. However, the details should be interesting even to those familiar with the area.
When discussing “broad basins” in the loss landscape of a DNN, the Hessian of loss is often referenced. This post explains a simple theoretical approximation of basin volume which uses the Hessian of loss.[1]
Suppose our minimum has loss=0. Define the basin as the region of parameter space draining to our minimum where loss < threshold T.[2]
Simplest model
If all eigenvalues of the Hessian are positive and non trivial,[3] we can approximate the loss as a paraboloid centered on our minimum:
The vertical axis is loss, and the horizontal plane is parameter space. The shape of the basin in parameter space is the shadow of this paraboloid, which is an ellipsoid.
The principal directions of curvature of the paraboloid are given by the eigenvectors of the Hessian. The curvatures (second derivative) in each of those directions is given by the corresponding eigenvalue.
Radii of the ellipsoid: If we start at our minimum and walk away in a principal direction, the loss as a function of distance traveled is L(x)=12λix2, where λi is the Hessian eigenvalue for that direction. So given our loss threshold T, we will hit that threshold at a distance of x=√2Tλi. This is the radius of the loss-basin ellipsoid in that direction.
The volume of the ellipsoid is Vbasin=Vn∏i√2Tλi, where the constant Vn is the volume of the unit n-ball in n dimensions. Since the product of the eigenvalues is the determinant of the Hessian, we can write this as:
Vbasin=Vn(2T)n/2√det[Hessian]So the basin volume is inversely proportional to the square root of the determinant of the Hessian. Everything in the numerator is a constant, so only the determinant of the Hessian matters in this model.
The problem with this model is that the determinant of the Hessian is usually zero, due to zero eigenvalues.
Fixing the model
If we don’t include a regularization term in the loss, the basin as we defined it earlier can actually be infinitely big (it’s not just a problem with the paraboloid model). However, we don’t really care about volume that is so far from the origin that it is never reached.
A somewhat principled way to fix the model is to look at volume weighted by the initialization distribution. This is easiest to work with if the initialization is Gaussian. To make the math tractable, we can replace our ellipsoid with a “fuzzy ellipsoid”—i.e. a multivariate Gaussian. Now we just have to integrate the product of two Gaussians, which should be easy. There are also somewhat principled reasons for using a “fuzzy ellipsoid”, which I won’t explain here.
However, this is only somewhat principled; if you think about it further, it starts to become unclear: Should we use the initialization Gaussian, or one based on the expected final L2 norm? What about cases where the norm peaks in the middle of training, and is smaller at the start and finish?
If we have an L2 regularization term in the loss, then the infinite volume problem usually goes away; the L2 term makes all the eigenvalues positive, so the formula is fine. If we have weight decay, we can interpret this as L2 regularization and add it to the loss.
For a relatively simple approximation, I recommend the formula:
Vbasin=Vn(2T)n/2√det[Hessian(Loss) + (λ+c)In]Where:
Loss is the unregularized loss
λ is the amount of weight decay (or L2 regularization 12λ||θ||2)
c=k/σ2, where σ is the standard deviation of the initialization Gaussian, and k is a constant on the order of unity. I have not calculated the theoretically most appropriate value of k. For a crude model, k=1 is probably good enough.
T is the loss threshold. If you really care about the absolute volume, you can try to set T empirically by looking at where the paraboloid approximation breaks. If you only care about volume relative to other basins, you can ignore T since it’s a constant.
Estimation in practice
If the DNN of interest is large (>10k params for instance), the Hessian becomes very unwieldy.[4] Luckily, it is possible to efficiently estimate the quantity det[Hessian(Loss) + (λ+c)In] without ever computing the Hessian.
One correct[5] method of doing this is to get the eigenvalue spectrum of the Hessian using stochastic Lanczos quadrature.[6] Then shift the spectrum up by λ+c and estimate the product.
Roasting the literature a bit
The “easy way out” is to use the trace of the Hessian instead of the determinant. This is extremely easy to estimate: Just sample the second derivative in random directions, and the average value is proportional to the trace. The problem is that the trace is simply the wrong measure, and is probably a somewhat poor proxy for the determinant.
Most (all?) of the flatness and volume measures I have seen in the literature are actually tracking the trace. There is one (Keskar et. al.)[7] which seems to be correcting in the wrong direction (increasing the influence of large eigenvalues relative to the trace, when it should be doing the opposite).[8] There is another which samples ellipsoid radius in random directions and calculates the volume of an ellipsoid slice in that direction (which is proportional to rn). While this is technically an unbiased estimator for finite ellipsoids, it has two problems in practice:[9]
The ellipsoid is usually actually infinite, which means the method is sampling to estimate an infinite quantity. (Predictably, the median estimate goes up without bound as we increase the number of samples.)
There are far too few samples to get a good estimate of the determinant, and the thing which is tracked in practice is quite trace-like.
Information theory
How many bits does it take to specify (locate) a loss basin?
The simplest answer is −log2(V), where V is the initialization-weighted volume of the basin. The weighting is done such that it integrates to 1.
Note that this model is nowhere close to perfect, and also isn’t computationally tractable for large networks without further tricks/approximations.
Having a threshold isn’t necessarily desirable or standard, but it makes it easier to model.
This condition basically never happens for DNNs; we’ll see how to fix this in the next section.
I think explicitly calculating the eigenvalues and eigenvectors is O(n3)
This only works well if (λ+c) is significantly larger than the resolution of the stochastic Lanczos quadrature.
Warning: The math is very hard to understand. I think library implementations exist online; I have not used them though. If you try implementing it yourself, it will probably be a massive pain.
This paper is widely cited and generally very good.
The determinant is a product, so it is more sensitive to small eigenvalues than the trace.
I have confirmed with simulations that it is flawed for very large n. Doing the equivalent of our (λ+c)In correction fixes the first issue but not the second.
Summary of the first two sections: You can approximate the loss as a paraboloid, which gives you an ellipsoid as the basin. The eigenvalues of the Hessian of loss give you the curvatures. The volume of the ellipsoid is proportional to 1√det[Hessian] (recall that determinant = product of eigenvalues). This doesn’t actually work because the eigenvalues can be zero. You can fix this by adding a constant to every eigenvalue.