ryan_greenblatt comments on Interpretability: Integrated Gradients is a decent attribution method

ryan_greenblatt 21 May 2024 2:59 UTC
2 points
0

Integrated gradients is a computationally efficient attribution method (compared to activation patching / ablations) grounded in a series of axioms.

Maybe I’m confused, but isn’t integrated gradients strictly slower than an ablation to a baseline?
- Lucius Bushnaq 21 May 2024 9:16 UTC
  2 points
  0
  Parent
  If you want to get attributions between all pairs of basis elements/features in two layers, attributions based on the effect of a marginal ablation will take you $d^{2}$ forward passes, where $d$ is the number of features in a layer. Integrated gradients will take $O (d)$ backward passes, and if you’re willing to write custom code that exploits the specific form of the layer transition, it can take less than that.
  If you’re averaging over a data set, IG is also amendable to additional cost reduction through stochastic source techniques.
- StefanHex 21 May 2024 9:06 UTC
  1 point
  0
  Parent
  Maybe I’m confused, but isn’t integrated gradients strictly slower than an ablation to a baseline?
  
  For a single interaction yes (1 forward pass vs integral with n_alpha integration steps, each requiring a backward pass).
  
  For many interactions (e.g. all connections between two layers) IGs can be faster:
  - Ablation requires d_embed^2 forward passes (if you want to get the effect of every patch on the loss)
  - Integrated gradients requires d_embed * n_alpha forward & backward passes
  (This is assuming you do path patching rather than “edge patching”, which you should in this scenario.)
  
  Sam Marks makes a similar point in Sparse Feature Circuits, near equations (2), (3), and (4).