hugofry comments on Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

hugofry 30 Apr 2024 23:17 UTC
1 point
1
Ah yes! I tried doing exactly this to produce a sort of ‘logit lens’ to explain the SAE features. In particular I tried the following.
- Take an SAE feature encoder direction and map it directly to the multimodal space to get an embedding.
- Pass each of the ImageNet text prompts “A photo of a {label}.” through the CLIP text model to generate the multimodal embeddings for each ImageNet class.
- Calculate the cosine similarities between the SAE embedding and the ImageNet class embeddings. Pass this through a softmax to get a probability distribution.
- Look at the ImageNet labels with a high probability—this should give some explanation as to what the SAE feature is representing.
Surprisingly, this did not work at all! I only spent a small amount of time trying to get this to work (<1day), so I’m planning to try again. If I remember correctly, I also tried the same analysis for the decoder feature vector and also tried shifting by the decoder bias vector too—both of these didn’t seem to provide good ImageNet class explanations of the SAE features. I will try doing this again and I can let you know how it goes!
- LawrenceC 4 May 2024 16:56 UTC
  4 points
  1
  Parent
  Huh, that’s indeed somewhat surprising if the SAE features are capturing the things that matter to CLIP (in that they reduce loss) and only those things, as opposed to “salient directions of variation in the data”. I’m curious exactly what “failing to work” means—here I think the negative result (and the exact details of said result) are argubaly more interesting than a positive result would be.