TL;DR: We show that fine-tuning on the vision-language task does not improve the alignment in brain regions that are thought to support the integration of multimodal information over their pre-trained counterparts.
Abstract: Integrating information from multiple modalities is arguably one of the essential prerequisites for grounding artificial intelligence systems with an understanding of the real world. Recent advances in video transformers that jointly learn from vision, text, and sound over time have made some progress toward this goal, but the degree to which these models integrate information from the input modalities still remains unclear. In this work, we present a promising approach for probing a multimodal video transformer model by leveraging neuroscientific evidence of multimodal information processing in the brain. We use the brain recordings of subjects watching a popular TV show to interpret the integration of multiple modalities in a video transformer, before and after it is trained to perform a question-answering task that requires vision and language information. For the early and middle layers, we show that fine-tuning on the vision-language task does not improve the alignment in brain regions that are thought to support the integration of multimodal information over their pre-trained counterparts. We further show that the top layers of the fine-tuned model align substantially less with the brain representations, and yield better task performances than other layers, which indicates that the task may require additional information from the one available in the brain recordings.
[Linkpost] Interpreting Multimodal Video Transformers Using Brain Recordings
This is a linkpost for https://openreview.net/forum?id=p-vL3rmYoqh.
TL;DR: We show that fine-tuning on the vision-language task does not improve the alignment in brain regions that are thought to support the integration of multimodal information over their pre-trained counterparts.
Abstract: Integrating information from multiple modalities is arguably one of the essential prerequisites for grounding artificial intelligence systems with an understanding of the real world. Recent advances in video transformers that jointly learn from vision, text, and sound over time have made some progress toward this goal, but the degree to which these models integrate information from the input modalities still remains unclear. In this work, we present a promising approach for probing a multimodal video transformer model by leveraging neuroscientific evidence of multimodal information processing in the brain. We use the brain recordings of subjects watching a popular TV show to interpret the integration of multiple modalities in a video transformer, before and after it is trained to perform a question-answering task that requires vision and language information. For the early and middle layers, we show that fine-tuning on the vision-language task does not improve the alignment in brain regions that are thought to support the integration of multimodal information over their pre-trained counterparts. We further show that the top layers of the fine-tuned model align substantially less with the brain representations, and yield better task performances than other layers, which indicates that the task may require additional information from the one available in the brain recordings.