That makes sense to me. I guess I’m dissatisfied here because the idea of an ensemble seems to be that individual components in the ensemble are independent; whereas in the unraveled view of a residual network, different paths still interact with each other (e.g. if two paths overlap, then ablating one of them could also (in principle) change the value computed by the other path). This seems to be the mechanism that explains redundancy.
If I understand correctly, you’re saying that my expansion is wrong, because MLP∘(Att+Id)≠MLP∘Att+MLP∘Id, which I agree with.
Then isn’t it also true that Att∘(MLP+Id)≠Att∘MLP+Att∘Id
Also, if the output is not a sum of all separate paths, then what’s the point of the unraveled view?
Yes MLP∘(Att+Id)≠MLP∘Att+MLP∘Id is what I’m saying.
Yes I agree Att∘(MLP+Id)≠Att∘MLP+Att∘Id
(Firstly note that it can be true without being useful). In the Residual Networks Behave Like Ensembles of Relatively Shallow Networks paper, they discover that long paths are mostly not needed for the model. In Causal Scrubbing they intervene on the treeified view to understand which paths are causally relevant for particular behaviors.
That makes sense to me. I guess I’m dissatisfied here because the idea of an ensemble seems to be that individual components in the ensemble are independent; whereas in the unraveled view of a residual network, different paths still interact with each other (e.g. if two paths overlap, then ablating one of them could also (in principle) change the value computed by the other path). This seems to be the mechanism that explains redundancy.