Joseph Miller comments on The Residual Expansion: A Framework for thinking about Transformer Circuits

Joseph Miller 7 Aug 2024 19:19 UTC
3 points
0
Yes $M L P \circ (A t t + I d) \neq M L P \circ A t t + M L P \circ I d$ is what I’m saying.
1. Yes I agree $A t t \circ (M L P + I d) \neq A t t \circ M L P + A t t \circ I d$
2. (Firstly note that it can be true without being useful). In the Residual Networks Behave Like Ensembles of Relatively Shallow Networks paper, they discover that long paths are mostly not needed for the model. In Causal Scrubbing they intervene on the treeified view to understand which paths are causally relevant for particular behaviors.
- Daniel Tan 8 Aug 2024 8:34 UTC
  1 point
  0
  Parent
  That makes sense to me. I guess I’m dissatisfied here because the idea of an ensemble seems to be that individual components in the ensemble are independent; whereas in the unraveled view of a residual network, different paths still interact with each other (e.g. if two paths overlap, then ablating one of them could also (in principle) change the value computed by the other path). This seems to be the mechanism that explains redundancy.