It makes a lot of assumptions, but as I understand it if you:
a. Sample points near the minima [1].
b. Select only the lowest loss point from that sample and save it.
c. Repeat that process many times
d. Create a covariance matrix of the selected points
The covariance matrix will converge to the inverse of the Hessian, assuming the loss landscape is quadratic. Since the inverse of a matrix has the same rank, you could probably just use this covariance matrix to bound the local learning coefficient.
Though since a covariance matrix has rank less than n-1 (where n is the number of sample points) you would need to sample and evaluate roughly d/2 points. The process seems pretty parallelize-able though.
[1] Specifically using an an isotropic, unit variance normal distribution centered at the minima.
Getting the Hessian eigenvalues does not require calculating the full Hessian. You use Jacobian vector product methods in e.g. JAX. The Hessian itself never has to be explicitly represented in memory.
And even assuming the estimator for the Hessian pseudoinverse is cheap and precise, you’d still need to get its rank anyway, which would by default be just as expensive as getting the rank of the Hessian.
That makes sense, I guess it just comes down to an empirical question of which is easier.
Question about what you said earlier: How can you use the top/bottom eigenvalues to estimate the rank of the Hessian? I’m not as familiar with this so any pointers would be appreciated!
The rank of a matrix = the number of non-zero eigenvalues of the matrix!
So you can either use the top eigenvalues to count the non-zeros, or you can use the fact that an n×n matrix always has n eigenvalues to determine the number of non-zero eigenvalues by counting the bottom zero-eigenvalues.
Also for more detail on the “getting hessian eigenvalues without calculating the full hessian” thing, I’d really recommend Johns explanation in this linear algebra lecture he recorded.
Thanks for this! I misinterpreted Lucius as saying “use the single highest and single lowest eigenvalues to estimate the rank of a matrix” which I didn’t think was possible.
Counting the number of non-zero eigenvalues makes a lot more sense!
You may find this interesting “On the Covariance-Hessian Relation in Evolution Strategies”:
https://arxiv.org/pdf/1806.03674
It makes a lot of assumptions, but as I understand it if you: a. Sample points near the minima [1]. b. Select only the lowest loss point from that sample and save it. c. Repeat that process many times d. Create a covariance matrix of the selected points
The covariance matrix will converge to the inverse of the Hessian, assuming the loss landscape is quadratic. Since the inverse of a matrix has the same rank, you could probably just use this covariance matrix to bound the local learning coefficient.
Though since a covariance matrix has rank less than n-1 (where n is the number of sample points) you would need to sample and evaluate roughly d/2 points. The process seems pretty parallelize-able though.
[1] Specifically using an an isotropic, unit variance normal distribution centered at the minima.
Why would we want or need to do this, instead of just calculating the top/bottom Hessian eigenvalues?
Isn’t calculating the Hessian for large statistical models kind of hard? And aren’t second derivatives prone to numerical errors?
Agree that this is only valuable if sampling on the loss landscape is easier or more robust than calculating the Hessian.
Getting the Hessian eigenvalues does not require calculating the full Hessian. You use Jacobian vector product methods in e.g. JAX. The Hessian itself never has to be explicitly represented in memory.
And even assuming the estimator for the Hessian pseudoinverse is cheap and precise, you’d still need to get its rank anyway, which would by default be just as expensive as getting the rank of the Hessian.
That makes sense, I guess it just comes down to an empirical question of which is easier.
Question about what you said earlier: How can you use the top/bottom eigenvalues to estimate the rank of the Hessian? I’m not as familiar with this so any pointers would be appreciated!
The rank of a matrix = the number of non-zero eigenvalues of the matrix! So you can either use the top eigenvalues to count the non-zeros, or you can use the fact that an n×n matrix always has n eigenvalues to determine the number of non-zero eigenvalues by counting the bottom zero-eigenvalues.
Also for more detail on the “getting hessian eigenvalues without calculating the full hessian” thing, I’d really recommend Johns explanation in this linear algebra lecture he recorded.
Thanks for this! I misinterpreted Lucius as saying “use the single highest and single lowest eigenvalues to estimate the rank of a matrix” which I didn’t think was possible.
Counting the number of non-zero eigenvalues makes a lot more sense!