Was considering saving this for a followup post but it’s relatively self-contained, so here we go.
Why are huge coefficients sometimes okay? Let’s start by looking at norms per position after injecting a large vector at position 20.
This graph is explained by LayerNorm. Before using the residual stream we perform a LayerNorm
# transformer block forward() in GPT2
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
If x has very large magnitude, then the block doesn’t change it much relative to its magnitude. Additionally, attention is ran on the normalized x meaning only the “unscaled” version of x is moved between positions.
As expected, we see a convergence in probability along each token position when we look with the tuned lens.
You can see how for positions 1 & 2 the output distribution is decided at layer 20, since we overwrote the residual stream with a huge coefficient all the LayerNorm’d outputs we’re adding are tiny in comparison, then in the final LayerNorm we get ln(bigcoeff*diff + small) ~= ln(bigcoeff*diff) ~= ln(diff).
Additionally, attention is ran on the normalized x meaning only the “unscaled” version of x is moved between positions.
Thanks for writing this up, I hadn’t realized this. One conclusion I’m drawing is: If the values in the modified residual streams aren’t important to other computations in later sequence positions, then a large-coefficient addition will still lead to reasonable completions.
Was considering saving this for a followup post but it’s relatively self-contained, so here we go.
Why are huge coefficients sometimes okay? Let’s start by looking at norms per position after injecting a large vector at position 20.
This graph is explained by LayerNorm. Before using the residual stream we perform a LayerNorm
If
x
has very large magnitude, then the block doesn’t change it much relative to its magnitude. Additionally, attention is ran on the normalizedx
meaning only the “unscaled” version ofx
is moved between positions.As expected, we see a convergence in probability along each token position when we look with the tuned lens.
You can see how for positions 1 & 2 the output distribution is decided at layer 20, since we overwrote the residual stream with a huge coefficient all the LayerNorm’d outputs we’re adding are tiny in comparison, then in the final LayerNorm we get
ln(bigcoeff*diff + small) ~= ln(bigcoeff*diff) ~= ln(diff)
.Thanks for writing this up, I hadn’t realized this. One conclusion I’m drawing is: If the values in the modified residual streams aren’t important to other computations in later sequence positions, then a large-coefficient addition will still lead to reasonable completions.
Yeah, assuming by “not important” you mean “not relevant” (low attention score)