Interpretability: Integrated Gradients is a decent attribution method

A short post laying out our reasoning for using integrated gradients as attribution method. It is intended as a stand-alone post based on our LIB papers [1] [2]. This work was produced at Apollo Research.

Context

Understanding circuits in neural networks requires understanding how features interact with other features. There’s a lot of features and their interactions are generally non-linear. A good starting point for understanding the interactions might be to just figure out how strongly each pair of features in adjacent layers of the network interacts. But since the relationships are non-linear, how do we quantify their ‘strength’ in a principled manner that isn’t vulnerable to common and simple counterexamples? In other words, how do we quantify how much the value of a feature in layer should be attributed to a feature in layer ?

This is a well-known sort of problem originally investigated in cooperative game theory. A while ago it made its way into machine learning, where people were pretty interested in attributing neural network outputs to their inputs for a while. Lately it’s made its way into interpretability in the context of attributing variables in one hidden layer of a neural network to another.

Generally, the way people go about this is setting up a series of ‘common-sense’ axioms that the attribution method should fulfil in order to be self-consistent and act like an attribution is supposed to act. Then they try to show that there is one unique method that satisfies these axioms. Except that (a) people disagree about what axioms are ‘common-sense’, and (b) the axioms people maybe agree most on don’t quite single out a single method as unique, just a class of methods called path attributions. So no attribution method has really been generally accepted as the canonical ‘winner’ in the ml context yet. Though some methods are certainly more popular than others.

Integrated Gradients

Integrated gradients is a computationally efficient attribution method (compared to activation patching /​ ablations) grounded in a series of axioms. It was originally proposed the context of economics (Friedman 2004), and recently used to attribute neural networks outputs to their inputs(Sundararajan et al. 2017). Even more recently, they started being used for internal feature attribution as well (Marks et al. 2024, Redwood Research (unpublished) 2022).

Properties of integrated gradients

Suppose we want to explain to what extent the value of an activation in a layer of a neural network can be ‘attributed to’ the various components of the activations in layer upstream of .[1] For now, we do this for a single datapoint only. So we want to know how much can be attributed to . We’ll write this attribution as .

There is a list of four standard requirements of properties attribution methods should satisfy that single out path attributions as the only kind of attribution methods that can be used to answer this question. Integrated gradients, and other path attribution methods, fulfil all of these (Sundararajan et al. 2017).

  1. Implementation Invariance: If two different networks have activations , such that for all possible inputs , then the attributions for any in both networks is the same.

  2. Completeness: The sum over all attributions equals the value of , that is .

  3. Sensitivity: If does not depend (mathematically) on , the attribution of for is zero.

  4. Linearity: Let . Then the attribution from to should equal the weighted sum of its attributions for and .

If you add on a fifth requirement that the attribution method behaves sensibly under coordinate transformations, integrated gradients are the only attribution method that satisfies all five axioms:

  1. Consistency under Coordinate Transformations: If we transform layer into an alternate basis of orthonormal coordinates, where the activation vector is one-hot ( )[2] then the first component should receive the full attribution , and the other components should receive zero attribution.

In other words, all the attribution should go to the direction our activation vector actually lies in. If we go into an alternate basis of coordinates such that one of our coordinate basis vectors lies along , , then the component along should get all the attribution at data point , because the other components aren’t even active and thus obviously can’t influence anything.

We think that this is a pretty important property for an attribution method to have in the context of interpreting neural network internals. The hidden layers of neural networks don’t come with an obvious privileged basis. Their activations are vectors in a vector space, which we can view in any basis we please. So in a sense, any structure in the network internals that actually matters for the computation should be coordinate independent. If our attribution methods are not well-behaved under coordinate transformations, they can give all kinds of misleading results, for example by taking the network out of the subspace the activations are usually located in.

Property 4 already ensures that the attributions are well-behaved under linear coordinate transformations of the target layer . This 5th axiom ensures they’re also well-behaved under coordinate transforms in the starting layer .

We will show below that adding the 5th requirement singles out integrated gradients as the canonical attribution method that satisfies all five requirements.

Integrated gradient formula

The general integrated gradient formula to attribute the influence of feature in a layer on feature in layer is given by an integral along a straight-line path in layer activation space. To clarify notation, we introduce a function which maps activations from layer to . For example, in an MLP (bias folded in) we might have . Then we can write the attribution from to as

where is a point in the layer activation space, and the path is parameterised by , such that along the curve we have .[2]

Intuitively, this formula asks us to integrate the gradient of with respect to along a straight path from a baseline activation to the actual activation vector , and multiply the result with .

We illustrate the integrated gradient attribution with a two-dimensional example. The plot shows a feature in layer that we want to attribute to the two features and in layer . The attribution to (or ) is calculated by integrating the gradient of with respect to (or ) along a straight line from the baseline activation , here chosen to be , to the activation vector and multiplying the result by the activation (or ).

Proof sketch: Integrated Gradients are uniquely consistent under coordinate transformations

Friedman 2004 showed that any attribution method satisfying the first four axioms must be a path attribution of the form

or a convex combination (weighted average with weights ) of these

Each term is a line integral along a monotonous path in the activation space of layer that starts at the baseline and ends at the activation vector .

Claim: The only attribution that also satisfies the fifth axiom is the straight line from to . That is, for all the paths in the sum except for the path parametrised as

Proof sketch: Take as the mapping between layers and , with an orthogonal matrix , and . Then, for any monotonous paths which are not the straight line , at least one direction in layer with will be assigned an attribution .

Since no monotonous paths lead to a negative attribution, the sum over all paths must then also yield an attribution for those , unless for every path in the sum except .

The problem of choosing a baseline

The integrated gradient formula still has one free hyperparameter in it: The baseline . We’re trying to attribute the activations in one layer to the activations in another layer. This requires specifying the coordinate origin relative to which the activations are defined.

Zero might look like a natural choice here, but if we are folding the biases into the activations, do we want the baseline for the bias to be zero as well? Or maybe we want the origin to be the expectation value of the activation over the training dataset? But then we’d have a bit of a consistency problem with axiom 2 across layers, because the expectation value of a layer often will not equal its activation at the expectation value of the previous layer, . So, with this baseline the attributions to the activations in a layer would not add up to the activations in layer . In fact, for some activation functions, like sigmoids for example, , so baseline zero potentially has this consistency problem as well.

We don’t feel like we have a good framing for picking the baseline in a principled way yet.

Attributions over datasets

We now have a method for how to do attributions on single data points. But when we’re searching for circuits, we’re probably looking for variables that have strong attributions between each other on average, measured over many data points. But how do we average attributions for different data points into a single attribution over a data set in a principled way?

We don’t have a perfect answer to this question. We experimented with applying the integrated gradient definition to functionals, attributing measures of the size of the function to the functions but found counter-examples to those (e.g. cancellation between negative and positive attribution). Thus we decided to simply take the RMS over attributions on single datapoints



This averaged attribution does not itself fulfil axiom 2 (completeness), but it seems workable in practice. We have not found any counterexamples (situations where even though is obviously important for ) for good choices of bases (such as LIB).


Acknowledgements

This work was done as part of the LIB interpretability project [1] [2] at Apollo Research where it benefitted from empirical feedback: the method was implemented by Dan Braun, Nix Goldowsky-Dill, and Stefan Heimersheim. Earlier experiments were conducted by Avery Griffin, Marius Hobbhahn, and Jörn Stöhler.

  1. ^

    The activation vectors here are defined relative to some baseline . This can be zero, but it could also be the mean value over some data set.

  2. ^

    Integrated gradients still leaves us a free choice of baseline relative to which we measure activations. We chose 0 for most of this post for simplicity, but e.g. the dataset mean of the activations also works.