Jon Garcia comments on Residual stream norms grow exponentially over the forward pass

Jon Garcia 7 May 2023 22:34 UTC
4 points
0

Due to LayerNorm, it’s hard to cancel out existing residual stream features, but easy to overshadow existing features by just making new features 4.5% larger.

If I’m interpreting this correctly, then it sounds like the network is learning exponentially larger weights in order to compensate for an exponentially growing residual stream. However, I’m still not quite clear on why LayerNorm doesn’t take care of this.

To avoid this phenomenon, one idea that springs to mind is to adjust how the residual stream operates. For a neural network module f, the residual stream works by creating a combined output: r(x)=f(x)+x

You seem to suggest that the model essentially amplifies the features within the neural network in order to overcome the large residual stream: r(x)=f(1.045*x)+x

However, what if instead of adding the inputs directly, they were rescaled first by a compensatory weight?: r(x)=f(x)+1/1.045x=f(x)+0.957x

It seems to me that this would disincentivize f from learning the exponentially growing feature scales. Based on your experience, would you expect this to eliminate the exponential growth in the norm across layers? Why or why not?
- StefanHex 8 May 2023 20:57 UTC
  1 point
  0
  Parent
  
  If I’m interpreting this correctly, then it sounds like the network is learning exponentially larger weights in order to compensate for an exponentially growing residual stream. However, I’m still not quite clear on why LayerNorm doesn’t take care of this.
  
  I understand the network’s “intention” the other way around, I think that the network wants to have an exponentially growing residual stream. And in order to get an exponentially growing residual stream the model increases its weights exponentially.
  
  And our speculation for why the model would want this is our “favored explanation” mentioned above.