Thanks for this write-up! In case it’s of interest, we have also performed some exploratory interpretability work using the SVD of model weights.
We examine convolutional layers in models on a couple common vision tasks (CIFAR-10, ImageNet). In short, we similarly take the SVD of the weights in a CNN layer, WL=USVT, and project the hidden layer activations xl onto the ith singular vector V[i,:]xl. These singular direction “neurons” can then be studied with interpretability methods: we use hypergraphs, feature visualizations, and exemplary images. More detail can be found in The SVD of Convolutional Weights: A CNN Interpretability Framework and you can explore the OpenAI Microscope-inspired demo we created for a VGG-16 trained on ImageNet here (under the “Feature Visualization” page).
To briefly highlight a few common findings between our work and this approach, we
Also find that the top singular direction is systematically less interpretable. For the ImageNet VGG-16 model, the direction tended to encode something like a fur/hair texture, which is common across many classes. For example, see the 0th SVD neuron for the VGG-16 layers features_14, features_21, features_24, features_28 on our demo.
We find (following Martin and Mahoney) a similar distribution of singular values.
Qualitatively, the singular directions in the models we examined were at times more interpretable than neurons in the canonical basis.
And a couple questions we have:
Should we expect interpretability using the SVD of weight matrices to be more effective for transformers due to the linear residual stream (e.g., as opposed to ResNets, models without skip connections)?
There are probably scenarios where the decomposition is less appropriate. For example, how might the usefulness of this approach change when a model layer is less linear?
Thanks for this write-up! In case it’s of interest, we have also performed some exploratory interpretability work using the SVD of model weights.
We examine convolutional layers in models on a couple common vision tasks (CIFAR-10, ImageNet). In short, we similarly take the SVD of the weights in a CNN layer, WL=USVT, and project the hidden layer activations xl onto the ith singular vector V[i,:]xl. These singular direction “neurons” can then be studied with interpretability methods: we use hypergraphs, feature visualizations, and exemplary images. More detail can be found in The SVD of Convolutional Weights: A CNN Interpretability Framework and you can explore the OpenAI Microscope-inspired demo we created for a VGG-16 trained on ImageNet here (under the “Feature Visualization” page).
To briefly highlight a few common findings between our work and this approach, we
Also find that the top singular direction is systematically less interpretable. For the ImageNet VGG-16 model, the direction tended to encode something like a fur/hair texture, which is common across many classes. For example, see the 0th SVD neuron for the VGG-16 layers features_14, features_21, features_24, features_28 on our demo.
We find (following Martin and Mahoney) a similar distribution of singular values.
Qualitatively, the singular directions in the models we examined were at times more interpretable than neurons in the canonical basis.
And a couple questions we have:
Should we expect interpretability using the SVD of weight matrices to be more effective for transformers due to the linear residual stream (e.g., as opposed to ResNets, models without skip connections)?
There are probably scenarios where the decomposition is less appropriate. For example, how might the usefulness of this approach change when a model layer is less linear?