Daniel Tan comments on [missing post]

Daniel Tan 7 Aug 2024 8:14 UTC
2 points
0
If I understand correctly, you’re saying that my expansion is wrong, because $M L P \circ (A t t + I d) \neq M L P \circ A t t + M L P \circ I d$ , which I agree with.
1. Then isn’t it also true that $A t t \circ (M L P + I d) \neq A t t \circ M L P + A t t \circ I d$
2. Also, if the output is not a sum of all separate paths, then what’s the point of the unraveled view?
- Joseph Miller 7 Aug 2024 19:19 UTC
  3 points
  0
  Parent
  Yes $M L P \circ (A t t + I d) \neq M L P \circ A t t + M L P \circ I d$ is what I’m saying.
  1. Yes I agree $A t t \circ (M L P + I d) \neq A t t \circ M L P + A t t \circ I d$
  2. (Firstly note that it can be true without being useful). In the Residual Networks Behave Like Ensembles of Relatively Shallow Networks paper, they discover that long paths are mostly not needed for the model. In Causal Scrubbing they intervene on the treeified view to understand which paths are causally relevant for particular behaviors.
  - Daniel Tan 8 Aug 2024 8:34 UTC
    1 point
    0
    Parent
    That makes sense to me. I guess I’m dissatisfied here because the idea of an ensemble seems to be that individual components in the ensemble are independent; whereas in the unraveled view of a residual network, different paths still interact with each other (e.g. if two paths overlap, then ablating one of them could also (in principle) change the value computed by the other path). This seems to be the mechanism that explains redundancy.