Due to LayerNorm, it’s hard to cancel out existing residual stream features, but easy to overshadow existing features by just making new features 4.5% larger.
If I’m interpreting this correctly, then it sounds like the network is learning exponentially larger weights in order to compensate for an exponentially growing residual stream. However, I’m still not quite clear on why LayerNorm doesn’t take care of this.
To avoid this phenomenon, one idea that springs to mind is to adjust how the residual stream operates. For a neural network module f, the residual stream works by creating a combined output:
r(x)=f(x)+x
You seem to suggest that the model essentially amplifies the features within the neural network in order to overcome the large residual stream:
r(x)=f(1.045*x)+x
However, what if instead of adding the inputs directly, they were rescaled first by a compensatory weight?:
r(x)=f(x)+1/1.045x=f(x)+0.957x
It seems to me that this would disincentivize f from learning the exponentially growing feature scales. Based on your experience, would you expect this to eliminate the exponential growth in the norm across layers? Why or why not?
If I’m interpreting this correctly, then it sounds like the network is learning exponentially larger weights in order to compensate for an exponentially growing residual stream. However, I’m still not quite clear on why LayerNorm doesn’t take care of this.
I understand the network’s “intention” the other way around, I think that the network wants to have an exponentially growing residual stream. And in order to get an exponentially growing residual stream the model increases its weights exponentially.
And our speculation for why the model would want this is our “favored explanation” mentioned above.
If I’m interpreting this correctly, then it sounds like the network is learning exponentially larger weights in order to compensate for an exponentially growing residual stream. However, I’m still not quite clear on why LayerNorm doesn’t take care of this.
To avoid this phenomenon, one idea that springs to mind is to adjust how the residual stream operates. For a neural network module f, the residual stream works by creating a combined output: r(x)=f(x)+x
You seem to suggest that the model essentially amplifies the features within the neural network in order to overcome the large residual stream: r(x)=f(1.045*x)+x
However, what if instead of adding the inputs directly, they were rescaled first by a compensatory weight?: r(x)=f(x)+1/1.045x=f(x)+0.957x
It seems to me that this would disincentivize f from learning the exponentially growing feature scales. Based on your experience, would you expect this to eliminate the exponential growth in the norm across layers? Why or why not?
I understand the network’s “intention” the other way around, I think that the network wants to have an exponentially growing residual stream. And in order to get an exponentially growing residual stream the model increases its weights exponentially.
And our speculation for why the model would want this is our “favored explanation” mentioned above.