Not found any yet but that certainly doesn’t mean there aren’t any!
As per my reply to Neel’s comment below yes—most heads ~(4/6) per layer are highly localized and you can mask the attention window with no degradation to performance. A few per layer are responsible for all the information mixing between sequence positions. Re Source vs Destination, as per language model interp destination is the ‘current’ sequence position and source are the position it is attending to.
3a) Didn’t look into this—I think Whisper does speaker diarization but quite badly so I would imagine so
b) Either it hallucinates or it just transcribes one speaker
Either no transcript or hallucinations (eg makes up totally unrelated text)
What would be the purpose of this? - If you mean stitch together the Whisper encoder plus LLama as the decoder then fine-tune the decoder for a specific task this would be very easy (assuming you have enough compute and data)
Not found any yet but that certainly doesn’t mean there aren’t any!
As per my reply to Neel’s comment below yes—most heads ~(4/6) per layer are highly localized and you can mask the attention window with no degradation to performance. A few per layer are responsible for all the information mixing between sequence positions. Re Source vs Destination, as per language model interp destination is the ‘current’ sequence position and source are the position it is attending to. 3a) Didn’t look into this—I think Whisper does speaker diarization but quite badly so I would imagine so b) Either it hallucinates or it just transcribes one speaker
Either no transcript or hallucinations (eg makes up totally unrelated text)
What would be the purpose of this? - If you mean stitch together the Whisper encoder plus LLama as the decoder then fine-tune the decoder for a specific task this would be very easy (assuming you have enough compute and data)