I agree, this seems like exactly the same thing, which is great! In hindsight it’s not surprising that you / other people have already thought about this
Do you think the ‘tree-ified view’ (to use your name for it) is a good abstraction for thinking about how a model works? Are individual terms in the expansion the right unit of analysis?
Just to make it explicit and check my understanding—the residual decomposition is equivalent to edge / factorized view of the transformer in that we can express any term in the residual decomposition as a set of edges that form a path from input to output, e.g
And it follows that the (pre final layernorm) output of a transformer is the sum of all the “paths” from input to output constructed from the factorized DAG.
I agree, this seems like exactly the same thing, which is great! In hindsight it’s not surprising that you / other people have already thought about this
Do you think the ‘tree-ified view’ (to use your name for it) is a good abstraction for thinking about how a model works? Are individual terms in the expansion the right unit of analysis?
The treeified view is different from the factorized view! See figure 1 here.
I think the factorized view is pretty useful. But on other hand I think MLP + Attention Head circuits are too coarse-grained to be that interpretable.
Just to make it explicit and check my understanding—the residual decomposition is equivalent to edge / factorized view of the transformer in that we can express any term in the residual decomposition as a set of edges that form a path from input to output, e.g
Id = input → output
(Attn34∘MLP2∘Att01) = input-> Attn 1.0 → MLP 2 → Attn 4.3 → output
And it follows that the (pre final layernorm) output of a transformer is the sum of all the “paths” from input to output constructed from the factorized DAG.
Actually I think the residual decomposition is incorrect—see my other comment.