StefanHex comments on Interpretability: Integrated Gradients is a decent attribution method

StefanHex 21 May 2024 9:06 UTC
1 point
0
Maybe I’m confused, but isn’t integrated gradients strictly slower than an ablation to a baseline?

For a single interaction yes (1 forward pass vs integral with n_alpha integration steps, each requiring a backward pass).

For many interactions (e.g. all connections between two layers) IGs can be faster:
- Ablation requires d_embed^2 forward passes (if you want to get the effect of every patch on the loss)
- Integrated gradients requires d_embed * n_alpha forward & backward passes
(This is assuming you do path patching rather than “edge patching”, which you should in this scenario.)

Sam Marks makes a similar point in Sparse Feature Circuits, near equations (2), (3), and (4).