At any given point in the optimization process, we have a model mapping input image to (in this case) digit classification. We also have a big pile of test data and ground-truth classifications, so we can compute some measure of how close the model is to confidently classifying every test case correctly. And we can calculate the gradient of that w.r.t. the model’s parameters, indicating (1) what direction you want to make a small update in to improve the model and (2) what direction we actually do make a small update in at the next stage in the training process.
And you’ve taken all those gradient vectors and found, roughly speaking, that they come close to all lying in a somewhat lower-dimensional space than that of “all possible gradient vectors”.
Some ignorant questions. (I am far from being an SVD expert.)
As the optimization process proceeds, the updates will get smaller. Is it possible that (roughly speaking) the low-dimensional space you’re seeing is “just” the space of update vectors from early in the process? (Toy example: suppose we have a 1000-dimensional space and the nth update is in the direction of the nth basis vector and has magnitude 1/n, and we do 1000 update steps. Then the matrix we’re SVDing is diagonal, the SVD will look like identity . diagonal . identity, and the graph of singular values will look not entirely unlike the graphs you’ve shown.)
Have you tried doing similar things with other highish-dimensional optimization processes, and seeing whether they produce similar results or different ones? (If they produce similar results, then probably what you’re seeing is a consequence of some general property of such optimization processes. If they produce very different results, then it’s more likely that what you’re seeing is specific to the specific process you’re looking at.)
As the optimization process proceeds, the updates will get smaller. Is it possible that (roughly speaking) the low-dimensional space you’re seeing is “just” the space of update vectors from early in the process? (Toy example: suppose we have a 1000-dimensional space and the nth update is in the direction of the nth basis vector and has magnitude 1/n, and we do 1000 update steps. Then the matrix we’re SVDing is diagonal, the SVD will look like identity . diagonal . identity, and the graph of singular values will look not entirely unlike the graphs you’ve shown.)
It’s definitely the case that including earlier updates leads to different singular vectors than if you exclude them. But it’s not clear whether you should care about the earlier updates vs the later ones!
Oh yeah, I just remembered I had a way to figure out whether we’re actually getting a good approximation from our cutoff: look at what happens if you use the induced low rank approximation gradient update matrix as your gradients, then look at the loss of your alt model.
This is a good hypothesis, and seems like it can be checked by removing the first however many timesteps from the svd calculation
I have! I tried the same thing on a simpler network trained on an algorithmic task, and got similar results. In that case I got 10 singular vectors on 8M(?) time steps.
So, just to check I understand:
At any given point in the optimization process, we have a model mapping input image to (in this case) digit classification. We also have a big pile of test data and ground-truth classifications, so we can compute some measure of how close the model is to confidently classifying every test case correctly. And we can calculate the gradient of that w.r.t. the model’s parameters, indicating (1) what direction you want to make a small update in to improve the model and (2) what direction we actually do make a small update in at the next stage in the training process.
And you’ve taken all those gradient vectors and found, roughly speaking, that they come close to all lying in a somewhat lower-dimensional space than that of “all possible gradient vectors”.
Some ignorant questions. (I am far from being an SVD expert.)
As the optimization process proceeds, the updates will get smaller. Is it possible that (roughly speaking) the low-dimensional space you’re seeing is “just” the space of update vectors from early in the process? (Toy example: suppose we have a 1000-dimensional space and the nth update is in the direction of the nth basis vector and has magnitude 1/n, and we do 1000 update steps. Then the matrix we’re SVDing is diagonal, the SVD will look like identity . diagonal . identity, and the graph of singular values will look not entirely unlike the graphs you’ve shown.)
Have you tried doing similar things with other highish-dimensional optimization processes, and seeing whether they produce similar results or different ones? (If they produce similar results, then probably what you’re seeing is a consequence of some general property of such optimization processes. If they produce very different results, then it’s more likely that what you’re seeing is specific to the specific process you’re looking at.)
It’s definitely the case that including earlier updates leads to different singular vectors than if you exclude them. But it’s not clear whether you should care about the earlier updates vs the later ones!
Oh yeah, I just remembered I had a way to figure out whether we’re actually getting a good approximation from our cutoff: look at what happens if you use the induced low rank approximation gradient update matrix as your gradients, then look at the loss of your alt model.
This is a good hypothesis, and seems like it can be checked by removing the first however many timesteps from the svd calculation
I have! I tried the same thing on a simpler network trained on an algorithmic task, and got similar results. In that case I got 10 singular vectors on 8M(?) time steps.