I wouldn’t expect an LLM to do this. An LLM wants to predict the most likely next word, so is going to assign high probabilities to semantically similar words (hence why they are clustered in embedding space). Whisper is trying to do speech-to-text, so as well as needing to know about semantic similarity of words it also needs to know about words that sound the same. Eg if it thinks it heard ‘rug’, it is pretty likely that the person speaking actually said ‘mug’ hence these words are clustered. Does that make sense?
EllenaR
Not found any yet but that certainly doesn’t mean there aren’t any!
As per my reply to Neel’s comment below yes—most heads ~(4/6) per layer are highly localized and you can mask the attention window with no degradation to performance. A few per layer are responsible for all the information mixing between sequence positions. Re Source vs Destination, as per language model interp destination is the ‘current’ sequence position and source are the position it is attending to. 3a) Didn’t look into this—I think Whisper does speaker diarization but quite badly so I would imagine so b) Either it hallucinates or it just transcribes one speaker
Either no transcript or hallucinations (eg makes up totally unrelated text)
What would be the purpose of this? - If you mean stitch together the Whisper encoder plus LLama as the decoder then fine-tune the decoder for a specific task this would be very easy (assuming you have enough compute and data)
Re other layers in the encoder: There are only 4 layers in Whisper tiny, couldn’t find any ‘listenable’ features in the earlier layers 0,1 so I’m guessing they activate more on frequency patterns than human recognisable sounds. Simple linear probes trained on layers 2 and 3 suggest they learn language features (eg is_french) and is_speech. Haven’t looked into it any more than that though.
Re localisation of attention - ‘I’d predict that most but not all encoder heads are highly localised’ - this looks true when you look at the attn patterns per head. As you said most heads (4/6) in each layer are highly localised—you can mask them up to k=10. But there are 1 or 2 heads in each layer that are not so localized and are responsible for the degradation seen when you mask them.
Working on that one—the code is not in a shareable state yet but I will link a notebook here once it is!