A short post laying out our reasoning for using integrated gradients as attribution method. It is intended as a stand-alone post based on our LIB papers [1][2]. This work was produced at Apollo Research.
Context
Understanding circuits in neural networks requires understanding how features interact with other features. There’s a lot of features and their interactions are generally non-linear. A good starting point for understanding the interactions might be to just figure out how strongly each pair of features in adjacent layers of the network interacts. But since the relationships are non-linear, how do we quantify their ‘strength’ in a principled manner that isn’t vulnerable to common and simple counterexamples? In other words, how do we quantify how much the value of a feature in layer l+1 should be attributed to a feature in layer l?
This is a well-known sort of problem originally investigated in cooperative game theory. A while ago it made its way into machine learning, where people were pretty interested in attributing neural network outputs to their inputs for a while. Lately it’s made its way into interpretability in the context of attributing variables in one hidden layer of a neural network to another.
Generally, the way people go about this is setting up a series of ‘common-sense’ axioms that the attribution method should fulfil in order to be self-consistent and act like an attribution is supposed to act. Then they try to show that there is one unique method that satisfies these axioms. Except that (a) people disagree about what axioms are ‘common-sense’, and (b) the axioms people maybe agree most on don’t quite single out a single method as unique, just a class of methods called path attributions. So no attribution method has really been generally accepted as the canonical ‘winner’ in the ml context yet. Though some methods are certainly more popular than others.
Integrated Gradients
Integrated gradients is a computationally efficient attribution method (compared to activation patching / ablations) grounded in a series of axioms. It was originally proposed the context of economics (Friedman 2004), and recently used to attribute neural networks outputs to their inputs(Sundararajan et al. 2017). Even more recently, they started being used for internal feature attribution as well (Marks et al. 2024, Redwood Research (unpublished) 2022).
Properties of integrated gradients
Suppose we want to explain to what extent the value of an activation fl2i in a layer l2 of a neural network can be ‘attributed to’ the various components of the activations fl1=[fl10,…,fl1d] in layer l1 upstream of l2.[1] For now, we do this for a single datapoint only. So we want to know how much fl2i(x) can be attributed to fl1j(x). We’ll write this attribution as Al2,l1i,j(x).
There is a list of four standard requirements of properties attribution methods should satisfy that single out path attributions as the only kind of attribution methods that can be used to answer this question. Integrated gradients, and other path attribution methods, fulfil all of these (Sundararajan et al. 2017).
Implementation Invariance: If two different networks have activations fl2, gl2 such that fl2i(fl1)=gl2i(fl1) for all possible inputs fl1, then the attributions for any fl1j in both networks is the same.
Completeness: The sum over all attributions equals the value of fl2i(x), that is ∑jAl2,l1i,j(x)=fl2i(x).
Sensitivity: If fl2i does not depend (mathematically) on fl1j, the attribution of fl1j for fl2i is zero.
Linearity: Let g=a1fl2i1+a2fl2i2. Then the attribution from fl1j to g should equal the weighted sum of its attributions for fl2i1 and fl2i2.
If you add on a fifth requirement that the attribution method behaves sensibly under coordinate transformations, integrated gradients are the only attribution method that satisfies all five axioms:
Consistency under Coordinate Transformations: If we transform layer l1 into an alternate basis of orthonormal coordinates, where the activation vector is one-hot ( fl1(x)=[||fl1(x)||,0,…,0])[2] then the first component fl10(x) should receive the full attribution fl2i(x), and the other components should receive zero attribution.
In other words, all the attribution should go to the direction our activation vector fl(x)actually lies in. If we go into an alternate basis of coordinates such that one of our coordinate basis vectors e1 lies along fl1(x), e1=fl1(x)||fl1(x)||, then the component along e1 should get all the attribution at data point x, because the other components aren’t even active and thus obviously can’t influence anything.
We think that this is a pretty important property for an attribution method to have in the context of interpreting neural network internals. The hidden layers of neural networks don’t come with an obvious privileged basis. Their activations are vectors in a vector space, which we can view in any basis we please. So in a sense, any structure in the network internals that actually matters for the computation should be coordinate independent. If our attribution methods are not well-behaved under coordinate transformations, they can give all kinds of misleading results, for example by taking the network out of the subspace the activations are usually located in.
Property 4 already ensures that the attributions are well-behaved under linear coordinate transformations of the target layer l2. This 5th axiom ensures they’re also well-behaved under coordinate transforms in the starting layer l1.
We will show below that adding the 5th requirement singles out integrated gradients as the canonical attribution method that satisfies all five requirements.
Integrated gradient formula
The general integrated gradient formula to attribute the influence of feature fl1j(x) in a layer l1 on feature fl2i(x) in layer l2 is given by an integral along a straight-line path C in layer l1 activation space. To clarify notation, we introduce a function which maps activations from layer l1 to l2Fl2,l1:Rdl1→Rdl2 . For example, in an MLP (bias folded in) we might have Fl2,l1(fl)=ReLU(Wl1fl1). Then we can write the attribution from fl1j(x) to fl2i(x) as
where z is a point in the layer l1 activation space, and the path C is parameterised by α∈[0,1], such that along the curve we have z(α)=αfl1(x)+(1−α)bl1.[2]
Intuitively, this formula asks us to integrate the gradient of fl2i(x) with respect to fl1j(x) along a straight path from a baseline activation bl1 to the actual activation vector fl1(x), and multiply the result with fl1j(x).
Proof sketch: Integrated Gradients are uniquely consistent under coordinate transformations
Friedman 2004 showed that any attribution method satisfying the first four axioms must be a path attribution of the form
Each term is a line integral along a monotonous path Ck in the activation space of layer l1 that starts at the baseline bl1 and ends at the activation vector fl1(x).
Claim: The only attribution that also satisfies the fifth axiom is the straight line from bl1 to fl1(x). That is, ck=0 for all the paths in the sum except for the path parametrised as
zl11(α)=bl1(1−α)+αfl1(x).
Proof sketch: Take fl2(fl1(x))=bl1+∑kU1,k(fl1k(x)−bl1k)e−z∑{i|i>1}(∑jUi,j(fl1j(x)−bl1j))2 as the mapping between layers l1 and l2, with U∈R an orthogonal matrix UUT=1, and U1,k=fl1k(x)−bl1k||fl1(x)−bl1||. Then, for any monotonous paths Ck which are not the straight line zl11(α), at least one direction v in layer l1 with v⋅fl1(x)=0 will be assigned an attribution >0.
Since no monotonous paths lead to a negative attribution, the sum over all paths must then also yield an attribution >0 for those v, unless ck=0 for every path in the sum except zl11(α)=bl1(1−α)+αfl1(x).
The problem of choosing a baseline
The integrated gradient formula still has one free hyperparameter in it: The baseline bl. We’re trying to attribute the activations in one layer to the activations in another layer. This requires specifying the coordinate origin relative to which the activations are defined.
Zero might look like a natural choice here, but if we are folding the biases into the activations, do we want the baseline for the bias to be zero as well? Or maybe we want the origin to be the expectation value of the activation E(fl) over the training dataset? But then we’d have a bit of a consistency problem with axiom 2 across layers, because the expectation value of a layer E(fl+1) often will not equal its activation at the expectation value E(fl) of the previous layer, E(fl+1)≠Fl+1,l(E(fl)). So, with this baseline the attributions to the activations in a layer l would not add up to the activations in layer l+1. In fact, for some activation functions, like sigmoids for example, 0≠Fl+1,l(0), so baseline zero potentially has this consistency problem as well.
We don’t feel like we have a good framing for picking the baseline in a principled way yet.
Attributions over datasets
We now have a method for how to do attributions on single data points. But when we’re searching for circuits, we’re probably looking for variables that have strong attributions between each other on average, measured over many data points. But how do we average attributions for different data points into a single attribution over a data set in a principled way?
We don’t have a perfect answer to this question. We experimented with applying the integrated gradient definition to functionals, attributing measures of the size of the function fl2i:x→fl2i(x) to the functions fl1j:→fl2i(x) but found counter-examples to those (e.g. cancellation between negative and positive attribution). Thus we decided to simply take the RMS over attributions on single datapoints
Al2,l1i,j(D)=√∑xAl2,l1i,j(x)2.
This averaged attribution does not itself fulfil axiom 2 (completeness), but it seems workable in practice. We have not found any counterexamples (situations where Al2,l1i,j(D)=0 even though fl1j is obviously important for fl2i) for good choices of bases (such as LIB).
Acknowledgements
This work was done as part of the LIB interpretability project [1][2] at Apollo Research where it benefitted from empirical feedback: the method was implemented by Dan Braun, Nix Goldowsky-Dill, and Stefan Heimersheim. Earlier experiments were conducted by Avery Griffin, Marius Hobbhahn, and Jörn Stöhler.
Integrated gradients still leaves us a free choice of baseline relative to which we measure activations. We chose 0 for most of this post for simplicity, but e.g. the dataset mean of the activations also works.
Interpretability: Integrated Gradients is a decent attribution method
A short post laying out our reasoning for using integrated gradients as attribution method. It is intended as a stand-alone post based on our LIB papers [1] [2]. This work was produced at Apollo Research.
Context
Understanding circuits in neural networks requires understanding how features interact with other features. There’s a lot of features and their interactions are generally non-linear. A good starting point for understanding the interactions might be to just figure out how strongly each pair of features in adjacent layers of the network interacts. But since the relationships are non-linear, how do we quantify their ‘strength’ in a principled manner that isn’t vulnerable to common and simple counterexamples? In other words, how do we quantify how much the value of a feature in layer l+1 should be attributed to a feature in layer l?
This is a well-known sort of problem originally investigated in cooperative game theory. A while ago it made its way into machine learning, where people were pretty interested in attributing neural network outputs to their inputs for a while. Lately it’s made its way into interpretability in the context of attributing variables in one hidden layer of a neural network to another.
Generally, the way people go about this is setting up a series of ‘common-sense’ axioms that the attribution method should fulfil in order to be self-consistent and act like an attribution is supposed to act. Then they try to show that there is one unique method that satisfies these axioms. Except that (a) people disagree about what axioms are ‘common-sense’, and (b) the axioms people maybe agree most on don’t quite single out a single method as unique, just a class of methods called path attributions. So no attribution method has really been generally accepted as the canonical ‘winner’ in the ml context yet. Though some methods are certainly more popular than others.
Integrated Gradients
Integrated gradients is a computationally efficient attribution method (compared to activation patching / ablations) grounded in a series of axioms. It was originally proposed the context of economics (Friedman 2004), and recently used to attribute neural networks outputs to their inputs(Sundararajan et al. 2017). Even more recently, they started being used for internal feature attribution as well (Marks et al. 2024, Redwood Research (unpublished) 2022).
Properties of integrated gradients
Suppose we want to explain to what extent the value of an activation fl2i in a layer l2 of a neural network can be ‘attributed to’ the various components of the activations fl1=[fl10,…,fl1d] in layer l1 upstream of l2.[1] For now, we do this for a single datapoint only. So we want to know how much fl2i(x) can be attributed to fl1j(x). We’ll write this attribution as Al2,l1i,j(x).
There is a list of four standard requirements of properties attribution methods should satisfy that single out path attributions as the only kind of attribution methods that can be used to answer this question. Integrated gradients, and other path attribution methods, fulfil all of these (Sundararajan et al. 2017).
Implementation Invariance: If two different networks have activations fl2, gl2 such that fl2i(fl1)=gl2i(fl1) for all possible inputs fl1, then the attributions for any fl1j in both networks is the same.
Completeness: The sum over all attributions equals the value of fl2i(x), that is ∑jAl2,l1i,j(x)=fl2i(x).
Sensitivity: If fl2i does not depend (mathematically) on fl1j, the attribution of fl1j for fl2i is zero.
Linearity: Let g=a1fl2i1+a2fl2i2. Then the attribution from fl1j to g should equal the weighted sum of its attributions for fl2i1 and fl2i2.
If you add on a fifth requirement that the attribution method behaves sensibly under coordinate transformations, integrated gradients are the only attribution method that satisfies all five axioms:
Consistency under Coordinate Transformations: If we transform layer l1 into an alternate basis of orthonormal coordinates, where the activation vector is one-hot ( fl1(x)=[||fl1(x)||,0,…,0])[2] then the first component fl10(x) should receive the full attribution fl2i(x), and the other components should receive zero attribution.
In other words, all the attribution should go to the direction our activation vector fl(x)actually lies in. If we go into an alternate basis of coordinates such that one of our coordinate basis vectors e1 lies along fl1(x), e1=fl1(x)||fl1(x)||, then the component along e1 should get all the attribution at data point x, because the other components aren’t even active and thus obviously can’t influence anything.
We think that this is a pretty important property for an attribution method to have in the context of interpreting neural network internals. The hidden layers of neural networks don’t come with an obvious privileged basis. Their activations are vectors in a vector space, which we can view in any basis we please. So in a sense, any structure in the network internals that actually matters for the computation should be coordinate independent. If our attribution methods are not well-behaved under coordinate transformations, they can give all kinds of misleading results, for example by taking the network out of the subspace the activations are usually located in.
Property 4 already ensures that the attributions are well-behaved under linear coordinate transformations of the target layer l2. This 5th axiom ensures they’re also well-behaved under coordinate transforms in the starting layer l1.
We will show below that adding the 5th requirement singles out integrated gradients as the canonical attribution method that satisfies all five requirements.
Integrated gradient formula
The general integrated gradient formula to attribute the influence of feature fl1j(x) in a layer l1 on feature fl2i(x) in layer l2 is given by an integral along a straight-line path C in layer l1 activation space. To clarify notation, we introduce a function which maps activations from layer l1 to l2 Fl2,l1:Rdl1→Rdl2 . For example, in an MLP (bias folded in) we might have Fl2,l1(fl)=ReLU(Wl1fl1). Then we can write the attribution from fl1j(x) to fl2i(x) as
Al2,l1ij(x):=∫Cdzj[∂∂zj(Fl2,l1i(z))]z=αfl1(x)+(1−α)bl1=fl1j(x)∫10dα[∂∂zj(Fl2,l1i(z))]z=αfl1(x)+(1−α)bl1.where z is a point in the layer l1 activation space, and the path C is parameterised by α∈[0,1], such that along the curve we have z(α)=αfl1(x)+(1−α)bl1.[2]
Intuitively, this formula asks us to integrate the gradient of fl2i(x) with respect to fl1j(x) along a straight path from a baseline activation bl1 to the actual activation vector fl1(x), and multiply the result with fl1j(x).
Proof sketch: Integrated Gradients are uniquely consistent under coordinate transformations
Friedman 2004 showed that any attribution method satisfying the first four axioms must be a path attribution of the form
Al2,l1ij(x):=∫Cdzj⎡⎣∂∂zl1j(Fl2,l1i(zl1))⎤⎦withzl1(α):R→Rnl1,zl1(0)=bl1,zl1(1)=fl1(x),or a convex combination (weighted average with weights ck) of these
Al2,l1ij(x):=∑kck∫Ckdzk,j⎡⎢⎣∂∂zl1k,j(Fl2,l1i(zl1k))⎤⎥⎦withzl1k(α):R→Rnl1,zl1k(0)=bl1,zl1k(1)=fl1(x),∑kck=1,ck≥0.Each term is a line integral along a monotonous path Ck in the activation space of layer l1 that starts at the baseline bl1 and ends at the activation vector fl1(x).
Claim: The only attribution that also satisfies the fifth axiom is the straight line from bl1 to fl1(x). That is, ck=0 for all the paths in the sum except for the path parametrised as
zl11(α)=bl1(1−α)+αfl1(x).Proof sketch: Take fl2(fl1(x))=bl1+∑kU1,k(fl1k(x)−bl1k)e−z∑{i|i>1}(∑jUi,j(fl1j(x)−bl1j))2 as the mapping between layers l1 and l2, with U∈R an orthogonal matrix UUT=1, and U1,k=fl1k(x)−bl1k||fl1(x)−bl1||. Then, for any monotonous paths Ck which are not the straight line zl11(α), at least one direction v in layer l1 with v⋅fl1(x)=0 will be assigned an attribution >0.
Since no monotonous paths lead to a negative attribution, the sum over all paths must then also yield an attribution >0 for those v, unless ck=0 for every path in the sum except zl11(α)=bl1(1−α)+αfl1(x).
The problem of choosing a baseline
The integrated gradient formula still has one free hyperparameter in it: The baseline bl. We’re trying to attribute the activations in one layer to the activations in another layer. This requires specifying the coordinate origin relative to which the activations are defined.
Zero might look like a natural choice here, but if we are folding the biases into the activations, do we want the baseline for the bias to be zero as well? Or maybe we want the origin to be the expectation value of the activation E(fl) over the training dataset? But then we’d have a bit of a consistency problem with axiom 2 across layers, because the expectation value of a layer E(fl+1) often will not equal its activation at the expectation value E(fl) of the previous layer, E(fl+1)≠Fl+1,l(E(fl)). So, with this baseline the attributions to the activations in a layer l would not add up to the activations in layer l+1. In fact, for some activation functions, like sigmoids for example, 0≠Fl+1,l(0), so baseline zero potentially has this consistency problem as well.
We don’t feel like we have a good framing for picking the baseline in a principled way yet.
Attributions over datasets
We now have a method for how to do attributions on single data points. But when we’re searching for circuits, we’re probably looking for variables that have strong attributions between each other on average, measured over many data points. But how do we average attributions for different data points into a single attribution over a data set in a principled way?
We don’t have a perfect answer to this question. We experimented with applying the integrated gradient definition to functionals, attributing measures of the size of the function fl2i:x→fl2i(x) to the functions fl1j:→fl2i(x) but found counter-examples to those (e.g. cancellation between negative and positive attribution). Thus we decided to simply take the RMS over attributions on single datapoints
Al2,l1i,j(D)=√∑xAl2,l1i,j(x)2.
This averaged attribution does not itself fulfil axiom 2 (completeness), but it seems workable in practice. We have not found any counterexamples (situations where Al2,l1i,j(D)=0 even though fl1j is obviously important for fl2i) for good choices of bases (such as LIB).
Acknowledgements
This work was done as part of the LIB interpretability project [1] [2] at Apollo Research where it benefitted from empirical feedback: the method was implemented by Dan Braun, Nix Goldowsky-Dill, and Stefan Heimersheim. Earlier experiments were conducted by Avery Griffin, Marius Hobbhahn, and Jörn Stöhler.
The activation vectors here are defined relative to some baseline b. This can be zero, but it could also be the mean value over some data set.
Integrated gradients still leaves us a free choice of baseline relative to which we measure activations. We chose 0 for most of this post for simplicity, but e.g. the dataset mean of the activations also works.