A mundane explanation of what’s happening: We know from the NTK literature that to a (very) first approximation, SGD only affects the weights in the final layer of fully connected networks. So we should expect the first layer to have a larger norm than preceding layers. It would not be too surprising if this was distributed exponentially, since running a simple simulation, where
wi(x)=∑9n=5−i1n!√xn
for i=1,…,5 and where x is the number of gradient steps, we get the graph
and looking at the weight distribution at a given time-step, this seems distributed exponentially.
Huh, thanks for this pointer! I had not read about NTK (Neural Tangent Kernel) before. What I understand you saying is something like SGD mainly affects weights the last layer, and the propagation down to each earlier layer is weakened by a factor, creating the exponential behaviour? This seems somewhat plausible though I don’t know enough about NTK to make a stronger statement.
I don’t understand the simulation you run (I’m not familiar with that equation, is this a common thing to do?) but are you saying the y levels of the 5 lines (simulating 5 layers) at the last time-step (finished training) should be exponentially increasing, from violet to red, green, orange, and blue? It doesn’t look exponential by eye? Or are you thinking of the value as a function of x (training time)?
I appreciate your comment, and looking for mundane explanations though! This seems the kind of thing where I would later say “Oh of course”
You’re right, that’s not an exponential. I was wrong. I don’t trust my toy model enough to be convinced my overall point is wrong. Unfortunately I don’t have the time this week to run something more in-depth.
A mundane explanation of what’s happening: We know from the NTK literature that to a (very) first approximation, SGD only affects the weights in the final layer of fully connected networks. So we should expect the first layer to have a larger norm than preceding layers. It would not be too surprising if this was distributed exponentially, since running a simple simulation, where
wi(x)=∑9n=5−i1n!√xn
for i=1,…,5 and where x is the number of gradient steps, we get the graph
and looking at the weight distribution at a given time-step, this seems distributed exponentially.
Huh, thanks for this pointer! I had not read about NTK (Neural Tangent Kernel) before. What I understand you saying is something like SGD mainly affects weights the last layer, and the propagation down to each earlier layer is weakened by a factor, creating the exponential behaviour? This seems somewhat plausible though I don’t know enough about NTK to make a stronger statement.
I don’t understand the simulation you run (I’m not familiar with that equation, is this a common thing to do?) but are you saying the y levels of the 5 lines (simulating 5 layers) at the last time-step (finished training) should be exponentially increasing, from violet to red, green, orange, and blue? It doesn’t look exponential by eye? Or are you thinking of the value as a function of x (training time)?
I appreciate your comment, and looking for mundane explanations though! This seems the kind of thing where I would later say “Oh of course”
You’re right, that’s not an exponential. I was wrong. I don’t trust my toy model enough to be convinced my overall point is wrong. Unfortunately I don’t have the time this week to run something more in-depth.
you mean the final layer?
Yes.