Joseph Miller comments on [missing post]

Joseph Miller 7 Aug 2024 7:50 UTC
3 points
1
@Oliver Daniels-Koch’s reply to my comment made me read this post again more carefully and now I think that that your formulation of the residual expansion is incorrect.
Given $T = (M L P + I d) \circ (A t t + I d)$ it does not follow that $T = (M L P \circ A t t) + (M L P) + (A t t) + I d$ because $M L P$ is a non-linear operation. It cannot be decomposed like this.
My understanding of your big summation (with $C$ representing any MLP or attention head):
$T (x) = x + n - 1 \sum i = 0 C_{i} (x) + n - 1 \sum i = 1 i - 1 \sum j = 0 C_{i} \circ C_{j} (x) + n - 1 \sum i = 2 i - 1 \sum j = 1 j - 1 \sum k = 0 C_{i} \circ C_{j} \circ C_{k} (x) + \dots$
again does not hold because the $C$ s are non-linear.
There are two similar ideas which do hold, namely (1) the treeified / unraveled view and (2) the factorized view (both of which are illustrated in figure 1 here), but your residual expansion / big summation is not equivalent to either.
1. The treeified / unraveled view is the most similar. It separates each path from input to output, but the difference is that this does not claim that the output is the sum of all separate paths.
2. The factorized view follows from treeified view and is just the observation that any point in the residual stream can be decomposed into the outputs of all previous components.
  $T (x) = x + n - 1 \sum i = 0 Output (C_{i})$
What links here?
- Joseph Miller's comment on The Residual Expansion: A Framework for thinking about Transformer Circuits by Daniel Tan (2 Aug 2024 22:53 UTC; 6 points)
- Daniel Tan 7 Aug 2024 8:14 UTC
  2 points
  0
  Parent
  If I understand correctly, you’re saying that my expansion is wrong, because $M L P \circ (A t t + I d) \neq M L P \circ A t t + M L P \circ I d$ , which I agree with.
  1. Then isn’t it also true that $A t t \circ (M L P + I d) \neq A t t \circ M L P + A t t \circ I d$
  2. Also, if the output is not a sum of all separate paths, then what’s the point of the unraveled view?
  - Joseph Miller 7 Aug 2024 19:19 UTC
    3 points
    0
    Parent
    Yes $M L P \circ (A t t + I d) \neq M L P \circ A t t + M L P \circ I d$ is what I’m saying.
    Yes I agree $A t t \circ (M L P + I d) \neq A t t \circ M L P + A t t \circ I d$
    (Firstly note that it can be true without being useful). In the Residual Networks Behave Like Ensembles of Relatively Shallow Networks paper, they discover that long paths are mostly not needed for the model. In Causal Scrubbing they intervene on the treeified view to understand which paths are causally relevant for particular behaviors.
    - Daniel Tan 8 Aug 2024 8:34 UTC
      1 point
      0
      Parent
      That makes sense to me. I guess I’m dissatisfied here because the idea of an ensemble seems to be that individual components in the ensemble are independent; whereas in the unraveled view of a residual network, different paths still interact with each other (e.g. if two paths overlap, then ablating one of them could also (in principle) change the value computed by the other path). This seems to be the mechanism that explains redundancy.