Just to make it explicit and check my understanding—the residual decomposition is equivalent to edge / factorized view of the transformer in that we can express any term in the residual decomposition as a set of edges that form a path from input to output, e.g
And it follows that the (pre final layernorm) output of a transformer is the sum of all the “paths” from input to output constructed from the factorized DAG.
Just to make it explicit and check my understanding—the residual decomposition is equivalent to edge / factorized view of the transformer in that we can express any term in the residual decomposition as a set of edges that form a path from input to output, e.g
Id = input → output
(Attn34∘MLP2∘Att01) = input-> Attn 1.0 → MLP 2 → Attn 4.3 → output
And it follows that the (pre final layernorm) output of a transformer is the sum of all the “paths” from input to output constructed from the factorized DAG.
Actually I think the residual decomposition is incorrect—see my other comment.