Oliver Daniels comments on [missing post]

Oliver Daniels 7 Aug 2024 5:22 UTC
1 point
0
Just to make it explicit and check my understanding—the residual decomposition is equivalent to edge / factorized view of the transformer in that we can express any term in the residual decomposition as a set of edges that form a path from input to output, e.g
$I d$ = input → output
$(A t t n_{4}^{3} \circ M L P_{2} \circ A t t_{1}^{0})$ = input-> Attn 1.0 → MLP 2 → Attn 4.3 → output
And it follows that the (pre final layernorm) output of a transformer is the sum of all the “paths” from input to output constructed from the factorized DAG.
- Joseph Miller 8 Aug 2024 5:07 UTC
  2 points
  1
  Parent
  Actually I think the residual decomposition is incorrect—see my other comment.