The hessian is just a multi-dimensional second derivative, basically. So a zero eigenvalue is a direction along which the second derivative is zero (flatter-bottomed than a parabola).
So the problem is that estimating basin size this way will return spurious infinities, not zeros.
Thanks for your response! I’m not sure I communicated what I meant well, so let me be a bit more concrete. Suppose our loss is parabolic L:R3→R, where L(x)=x21+x22. This is like a 2d parabola (but it’s convex hull / volume below a certain threshold is 3D). In 4D space, which is where the graph of this function lives and hence where I believe we are talking about basin volume, this has 0 volume. The hessian is the matrix:
H=⎡⎢⎣200020000⎤⎥⎦
This is conveniently already diagonal, and the 0 eigenvalue comes from the component x3, which is being ignored. My approach is to remove the 0-eigenspace, so we are working just in the subspace where the eigenvalues are positive, so we are left with just the matrix: [2002], after which we can apply the formula given in the post:
Vbasin=Vn(2T)n/2√det[Hessian]
If this determinant was 0 then dividing by 0 would get the spurious infinity (this is what you are talking about, right?). But if we remove the 0-eigenspace we are left with positive volume, and hence avoid this division by 0.
The loss is defined over all dimensions of parameter space, so L(x)=x21+x22 is still a function of all 3 x’s. You should think of it as L(x)=x21+x22+0x23. It’s thickness in the x3 direction is infinite, not zero.
Here’s what a zero-determinant Hessian corresponds to:
The basin here is not lower dimensional; it is just infinite in some dimension. The simplest way to fix this is to replace the infinity with some large value. Luckily, there is a fairly principled way to do this:
Regularization / weight decay provides actual curvature, which should be added in to the loss, and doing this is the same as adding λIn to the Hessian.
The scale of the initialization distribution provides a natural scale for how much volume an infinite sweep should count as (very roughly, the volume only matters if it overlaps with the initialization distribution, and the distance of sweep for which this is true is on the order of σ, the standard deviation of the initialization).
So the (λ+kσ2)In is a fairly principled correction, and much better than just “throwing out” the other dimensions. “Throwing out” dimensions is unprincipled, dimensionally incorrect, numerically problematic, and should give worse results.
Note that this is equivalent to replacing “size 1/0” with “size 1″. Issues with this become apparent if the scale of your system is much smaller or larger than 1. A better try might be to replace 0 with the average of the other eigenvalues, times a fudge factor. But still quite unprincipled—maybe better is to try to look at higher derivatives first or do nonlocal numerical estimation like described in the post.
The hessian is just a multi-dimensional second derivative, basically. So a zero eigenvalue is a direction along which the second derivative is zero (flatter-bottomed than a parabola).
So the problem is that estimating basin size this way will return spurious infinities, not zeros.
Thanks for your response! I’m not sure I communicated what I meant well, so let me be a bit more concrete. Suppose our loss is parabolic L:R3→R, where L(x)=x21+x22. This is like a 2d parabola (but it’s convex hull / volume below a certain threshold is 3D). In 4D space, which is where the graph of this function lives and hence where I believe we are talking about basin volume, this has 0 volume. The hessian is the matrix:
H=⎡⎢⎣200020000⎤⎥⎦This is conveniently already diagonal, and the 0 eigenvalue comes from the component x3, which is being ignored. My approach is to remove the 0-eigenspace, so we are working just in the subspace where the eigenvalues are positive, so we are left with just the matrix: [2002], after which we can apply the formula given in the post:
Vbasin=Vn(2T)n/2√det[Hessian]If this determinant was 0 then dividing by 0 would get the spurious infinity (this is what you are talking about, right?). But if we remove the 0-eigenspace we are left with positive volume, and hence avoid this division by 0.
The loss is defined over all dimensions of parameter space, so L(x)=x21+x22 is still a function of all 3 x’s. You should think of it as L(x)=x21+x22+0x23. It’s thickness in the x3 direction is infinite, not zero.
Here’s what a zero-determinant Hessian corresponds to:
The basin here is not lower dimensional; it is just infinite in some dimension. The simplest way to fix this is to replace the infinity with some large value. Luckily, there is a fairly principled way to do this:
Regularization / weight decay provides actual curvature, which should be added in to the loss, and doing this is the same as adding λIn to the Hessian.
The scale of the initialization distribution provides a natural scale for how much volume an infinite sweep should count as (very roughly, the volume only matters if it overlaps with the initialization distribution, and the distance of sweep for which this is true is on the order of σ, the standard deviation of the initialization).
So the (λ+kσ2)In is a fairly principled correction, and much better than just “throwing out” the other dimensions. “Throwing out” dimensions is unprincipled, dimensionally incorrect, numerically problematic, and should give worse results.
Note that this is equivalent to replacing “size 1/0” with “size 1″. Issues with this become apparent if the scale of your system is much smaller or larger than 1. A better try might be to replace 0 with the average of the other eigenvalues, times a fudge factor. But still quite unprincipled—maybe better is to try to look at higher derivatives first or do nonlocal numerical estimation like described in the post.