Sonia Joseph comments on Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems

Sonia Joseph 14 Mar 2024 17:01 UTC
1 point
0
Right now, there’s a lot to exploit with CLIP and ViTs so that will be the focus for awhile. We may expand to Flamingo or other models if there is demand.
Other modalities would be fascinating. I imagine they have their own idiosyncrasies. I would be interested in audio in the future but not at the expense of first exploiting vision.
Ideally, yes; a unified interp framework for any modality is the north star. I do think this will be a community effort. Research in language built off findings from many different groups and institutions. Vision and other modalities are currently just not in the same place.