LawrenceC comments on Performing an SVD on a time-series matrix of gradient updates on an MNIST network produces 92.5 singular values

LawrenceC 21 Dec 2022 1:43 UTC
2 points
0
1. As the optimization process proceeds, the updates will get smaller. Is it possible that (roughly speaking) the low-dimensional space you’re seeing is “just” the space of update vectors from early in the process? (Toy example: suppose we have a 1000-dimensional space and the nth update is in the direction of the nth basis vector and has magnitude 1/n, and we do 1000 update steps. Then the matrix we’re SVDing is diagonal, the SVD will look like identity . diagonal . identity, and the graph of singular values will look not entirely unlike the graphs you’ve shown.)
It’s definitely the case that including earlier updates leads to different singular vectors than if you exclude them. But it’s not clear whether you should care about the earlier updates vs the later ones!