Interesting work! Cool to see mech interp done in such different modalities.
Did you look at neurons in other layers in the encoder? I’m curious if there are more semantic or meaningful audio features. I don’t know how many layers the encoder in Whisper tiny has.
Re localisation of attention, did you compute statistics per head of how far away it attends? That seems a natural way to get more info on this—I’d predict that most but not all encoder heads are highly localised (just like language models!). The fact that k=75 starts to mess up performance demonstrates that such heads must exist, IMO. And it’d be cool to investigate what kind of attentional features exist—what are the induction heads of audio encoders?!
Re other layers in the encoder: There are only 4 layers in Whisper tiny, couldn’t find any ‘listenable’ features in the earlier layers 0,1 so I’m guessing they activate more on frequency patterns than human recognisable sounds. Simple linear probes trained on layers 2 and 3 suggest they learn language features (eg is_french) and is_speech. Haven’t looked into it any more than that though.
Re localisation of attention - ‘I’d predict that most but not all encoder heads are highly localised’ - this looks true when you look at the attn patterns per head. As you said most heads (4/6) in each layer are highly localised—you can mask them up to k=10. But there are 1 or 2 heads in each layer that are not so localized and are responsible for the degradation seen when you mask them.
Interesting work! Cool to see mech interp done in such different modalities.
Did you look at neurons in other layers in the encoder? I’m curious if there are more semantic or meaningful audio features. I don’t know how many layers the encoder in Whisper tiny has.
Re localisation of attention, did you compute statistics per head of how far away it attends? That seems a natural way to get more info on this—I’d predict that most but not all encoder heads are highly localised (just like language models!). The fact that k=75 starts to mess up performance demonstrates that such heads must exist, IMO. And it’d be cool to investigate what kind of attentional features exist—what are the induction heads of audio encoders?!
Re other layers in the encoder: There are only 4 layers in Whisper tiny, couldn’t find any ‘listenable’ features in the earlier layers 0,1 so I’m guessing they activate more on frequency patterns than human recognisable sounds. Simple linear probes trained on layers 2 and 3 suggest they learn language features (eg is_french) and is_speech. Haven’t looked into it any more than that though.
Re localisation of attention - ‘I’d predict that most but not all encoder heads are highly localised’ - this looks true when you look at the attn patterns per head. As you said most heads (4/6) in each layer are highly localised—you can mask them up to k=10. But there are 1 or 2 heads in each layer that are not so localized and are responsible for the degradation seen when you mask them.