I believe the horizon may be large because, even if the approximation is fairly good at any particular layer, the errors compound as you go through the layers. If we just apply the horizon at the final output the horizon is smaller.
However, if we apply at just the middle layer (6), the horizon is surprisingly small, so we would expect relatively little error propagated.
But this appears to be an outlier. Compare to 5 and 7.
I realized the previous experiment might be importantly misleading because it’s on a small 12 layer model. In larger models it would still be a big deal if the effective layer horizon was like 20 layers.
Previously the code was too slow to run on larger models. But I made an faster version and ran the same experiment on GPT-2 large (48 layers):
We clearly see the same pattern again. As TurnTrout predicted, there seems be something like an exponential decay in the importance of previous layers as you go futher back. I expect that on large models an the effective layer horizon is an importnat consideration.
Computing the exact layer-truncated residual streams on GPT-2 Small, it seems that the effective layer horizon is quite large:
I’m mean ablating every edge with a source node more than n layers back and calculating the loss on 100 samples from The Pile.
Source code: https://gist.github.com/UFO-101/7b5e27291424029d092d8798ee1a1161
I believe the horizon may be large because, even if the approximation is fairly good at any particular layer, the errors compound as you go through the layers. If we just apply the horizon at the final output the horizon is smaller.
However, if we apply at just the middle layer (6), the horizon is surprisingly small, so we would expect relatively little error propagated.
But this appears to be an outlier. Compare to 5 and 7.
Source: https://gist.github.com/UFO-101/5ba35d88428beb1dab0a254dec07c33b
I realized the previous experiment might be importantly misleading because it’s on a small 12 layer model. In larger models it would still be a big deal if the effective layer horizon was like 20 layers.
Previously the code was too slow to run on larger models. But I made an faster version and ran the same experiment on GPT-2 large (48 layers):
We clearly see the same pattern again. As TurnTrout predicted, there seems be something like an exponential decay in the importance of previous layers as you go futher back. I expect that on large models an the effective layer horizon is an importnat consideration.
Updated source: https://gist.github.com/UFO-101/41b7ff0b250babe69bf16071e76658a6