Garrett Baker comments on Residual stream norms grow exponentially over the forward pass

Garrett Baker 24 Jun 2023 19:07 UTC
3 points
0
A mundane explanation of what’s happening: We know from the NTK literature that to a (very) first approximation, SGD only affects the weights in the final layer of fully connected networks. So we should expect the first layer to have a larger norm than preceding layers. It would not be too surprising if this was distributed exponentially, since running a simple simulation, where
$w_{i} (x) = \sum_{n = 5 - i}^{9} \frac{1}{n!} {\sqrt{x}}^{n}$
for $i = 1, \dots, 5$ and where $x$ is the number of gradient steps, we get the graph
and looking at the weight distribution at a given time-step, this seems distributed exponentially.
- StefanHex 25 Jun 2023 12:29 UTC
  3 points
  2
  Parent
  Huh, thanks for this pointer! I had not read about NTK (Neural Tangent Kernel) before. What I understand you saying is something like SGD mainly affects weights the last layer, and the propagation down to each earlier layer is weakened by a factor, creating the exponential behaviour? This seems somewhat plausible though I don’t know enough about NTK to make a stronger statement.
  
  I don’t understand the simulation you run (I’m not familiar with that equation, is this a common thing to do?) but are you saying the y levels of the 5 lines (simulating 5 layers) at the last time-step (finished training) should be exponentially increasing, from violet to red, green, orange, and blue? It doesn’t look exponential by eye? Or are you thinking of the value as a function of x (training time)?
  
  I appreciate your comment, and looking for mundane explanations though! This seems the kind of thing where I would later say “Oh of course”
  - Garrett Baker 27 Jun 2023 4:17 UTC
    2 points
    0
    Parent
    You’re right, that’s not an exponential. I was wrong. I don’t trust my toy model enough to be convinced my overall point is wrong. Unfortunately I don’t have the time this week to run something more in-depth.
- TurnTrout 26 Jun 2023 17:53 UTC
  2 points
  0
  Parent
  So we should expect the first layer to have a larger norm than preceding layers.
  you mean the final layer?
  - Garrett Baker 27 Jun 2023 4:16 UTC
    2 points
    0
    Parent
    Yes.