Ulisse Mini comments on Steering GPT-2-XL by adding an activation vector

Ulisse Mini 20 May 2023 21:00 UTC
LW: 17 AF: 10
0
AF
Was considering saving this for a followup post but it’s relatively self-contained, so here we go.
Why are huge coefficients sometimes okay? Let’s start by looking at norms per position after injecting a large vector at position 20.
This graph is explained by LayerNorm. Before using the residual stream we perform a LayerNorm
```
# transformer block forward() in GPT2
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
```
If x has very large magnitude, then the block doesn’t change it much relative to its magnitude. Additionally, attention is ran on the normalized x meaning only the “unscaled” version of x is moved between positions.
As expected, we see a convergence in probability along each token position when we look with the tuned lens.
You can see how for positions 1 & 2 the output distribution is decided at layer 20, since we overwrote the residual stream with a huge coefficient all the LayerNorm’d outputs we’re adding are tiny in comparison, then in the final LayerNorm we get ln(bigcoeff*diff + small) ~= ln(bigcoeff*diff) ~= ln(diff).
What links here?
- Open problems in activation engineering by TurnTrout (24 Jul 2023 19:46 UTC; 51 points)
- TurnTrout 22 May 2023 14:20 UTC
  LW: 10 AF: 5
  0
  AF Parent
  Additionally, attention is ran on the normalized x meaning only the “unscaled” version of x is moved between positions.
  Thanks for writing this up, I hadn’t realized this. One conclusion I’m drawing is: If the values in the modified residual streams aren’t important to other computations in later sequence positions, then a large-coefficient addition will still lead to reasonable completions.
  - Ulisse Mini 22 May 2023 16:31 UTC
    3 points
    0
    Parent
    Yeah, assuming by “not important” you mean “not relevant” (low attention score)