The loss is defined over all dimensions of parameter space, so L(x)=x21+x22 is still a function of all 3 x’s. You should think of it as L(x)=x21+x22+0x23. It’s thickness in the x3 direction is infinite, not zero.
Here’s what a zero-determinant Hessian corresponds to:
The basin here is not lower dimensional; it is just infinite in some dimension. The simplest way to fix this is to replace the infinity with some large value. Luckily, there is a fairly principled way to do this:
Regularization / weight decay provides actual curvature, which should be added in to the loss, and doing this is the same as adding λIn to the Hessian.
The scale of the initialization distribution provides a natural scale for how much volume an infinite sweep should count as (very roughly, the volume only matters if it overlaps with the initialization distribution, and the distance of sweep for which this is true is on the order of σ, the standard deviation of the initialization).
So the (λ+kσ2)In is a fairly principled correction, and much better than just “throwing out” the other dimensions. “Throwing out” dimensions is unprincipled, dimensionally incorrect, numerically problematic, and should give worse results.
The loss is defined over all dimensions of parameter space, so L(x)=x21+x22 is still a function of all 3 x’s. You should think of it as L(x)=x21+x22+0x23. It’s thickness in the x3 direction is infinite, not zero.
Here’s what a zero-determinant Hessian corresponds to:
The basin here is not lower dimensional; it is just infinite in some dimension. The simplest way to fix this is to replace the infinity with some large value. Luckily, there is a fairly principled way to do this:
Regularization / weight decay provides actual curvature, which should be added in to the loss, and doing this is the same as adding λIn to the Hessian.
The scale of the initialization distribution provides a natural scale for how much volume an infinite sweep should count as (very roughly, the volume only matters if it overlaps with the initialization distribution, and the distance of sweep for which this is true is on the order of σ, the standard deviation of the initialization).
So the (λ+kσ2)In is a fairly principled correction, and much better than just “throwing out” the other dimensions. “Throwing out” dimensions is unprincipled, dimensionally incorrect, numerically problematic, and should give worse results.