Arthur Conmy comments on Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

Arthur Conmy 29 Apr 2024 21:25 UTC
13 points
8
Awesome work! I notice I am surprised that this just worked given just 1M datapoints (we use 1000x this with LMs, even small ones), and not needing any new techniques, and producing subjectively extremely abstract features (IMO).
It would be nice if the “guess the image” game was presented as a result rather than a fun side thing in this post. AFAICT that’s the only interpretability result that can’t be critiqued as cherry-picked. You should state front and center that the top features for arbitrary images are basically interpretable, it’s a great result!
- hugofry 30 Apr 2024 22:03 UTC
  3 points
  2
  Parent
  Thanks for the feedback! Yeah I was also surprised SAEs seem to work on ViTs pretty much straight out of the box (I didn’t even need to play around with the hyper parameters too much)! As I mentioned in the post I think it would be really interesting to train on a much larger (more typical) dataset—similar to the dataset the CLIP model was trained on.
  
  I also agree that I probably should have emphasised the “guess the image” game as a result rather than an aside, I’ll bare that in mind for future posts!