Arthur Conmy comments on Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers