Lucius Bushnaq comments on Interpretability: Integrated Gradients is a decent attribution method

Lucius Bushnaq 21 May 2024 11:56 UTC
2 points
0
The issue with single datapoints, at least in the context we used this for, which was building interaction graphs for the LIB papers, is that the answer to ‘what directions in the layer were relevant for computing the output?’ is always trivially just ‘the direction the activation vector was pointing in.’
This then leads to every activation vector becoming its own ‘feature’, which is clearly nonsense. To understand generalisation, we need to see how the network is re-using a small common set of directions to compute outputs for many different inputs. Which means looking at a dataset of multiple activations.
And basically the trouble a lot of work that attempts to generalize ends up with is that some phenomena are very particular to specific cases, so one risks losing a lot of information by only focusing on the generalizable findings.

The application we were interested in here was getting some well founded measure of how ‘strongly’ two features interact. Not a description of what the interaction is doing computationally. Just some way to tell whether it’s ‘strong’ or ‘weak’. We wanted this so we could find modules in the network.

Averaging over data loses us information about what the interaction is doing, but it doesn’t necessarily lose us information about interaction ‘strength’, since that’s a scalar quantity. We just need to set our threshold for connection relevance sensitive enough that making a sizeable difference on a very small handful of training datapoints still qualifies.