Image interpretability seems mostly so easy because humans are already really good at interpreting 2D images with local structure. But thinking about this does suggest an idea for language model interpretability—how practical is it to find text that a) has high probability according to the prior distribution, b) strongly activates one attention head or feed-forward neuron or something, c) only weakly activates other parts of the transformer (within some reference class)? Probably this has already been tried somewhere and gotten middling results.
On priors, I wouldn’t worry too much about c), since I would expect a ‘super stimulus’ for head A to not be a super stimulus for head B.
I think one of the problems is the discrete input space, i.e. how do you parameterize sequence that is being optimized?
One idea I just had was trying to fine-tune an LLM with a reward signal given by for example the magnitude of the residual delta coming from a particular head (we probably something else here, maybe net logit change?). The LLM then already encodes a prior over “sensible” sequences and will try to find one of those which activates the head strongly (however we want to operationalize that).
Image interpretability seems mostly so easy because humans are already really good
Thank you, this is a good point! I wonder how much of this is humans “doing the hard work” of interpreting the features. It raises the question of whether we will be able to interpret more advanced networks, especially if they evolve features that don’t overlap with the way humans process inputs.
The language model idea sounds cool! I don’t know language models well enough yet but I might come back to this once I get to work on transformers.
Super cool, thanks!
Image interpretability seems mostly so easy because humans are already really good at interpreting 2D images with local structure. But thinking about this does suggest an idea for language model interpretability—how practical is it to find text that a) has high probability according to the prior distribution, b) strongly activates one attention head or feed-forward neuron or something, c) only weakly activates other parts of the transformer (within some reference class)? Probably this has already been tried somewhere and gotten middling results.
On priors, I wouldn’t worry too much about c), since I would expect a ‘super stimulus’ for head A to not be a super stimulus for head B.
I think one of the problems is the discrete input space, i.e. how do you parameterize sequence that is being optimized?
One idea I just had was trying to fine-tune an LLM with a reward signal given by for example the magnitude of the residual delta coming from a particular head (we probably something else here, maybe net logit change?). The LLM then already encodes a prior over “sensible” sequences and will try to find one of those which activates the head strongly (however we want to operationalize that).
Thank you, this is a good point! I wonder how much of this is humans “doing the hard work” of interpreting the features. It raises the question of whether we will be able to interpret more advanced networks, especially if they evolve features that don’t overlap with the way humans process inputs.
The language model idea sounds cool! I don’t know language models well enough yet but I might come back to this once I get to work on transformers.