This post was very helpful to me, thank you. But probably not for the reasons you intended (it made me more technically amibitous in my quest to solve podcast diarization). That said, I do have some questions.
1) Are there any “glitch phonemes” analogous to glitch tokens e.g. SolidGoldMagikarp?
2) I don’t undestand this plot. Is it saying that the model’s attention is sometimes highly nonlocalized? Part of my confusion is that I don’t know what “Souce” and “Destination” means in this case.
3) How does the model handle multi-speaker audio? 3a) Are there features which shift iff the speaker changes? Text which is highly likely given that feature (e.g. “-” to reprsent someone is being cut-off)? 3b) What happens if multiple people are talking at once?
4) What happens when the model is listening to e.g. nature sounds, music, laughter etc.?
5) Unrelated, but how hard would it be to stitch together a whisper and e.g. Llama to make a little multi-modal model?
Not found any yet but that certainly doesn’t mean there aren’t any!
As per my reply to Neel’s comment below yes—most heads ~(4/6) per layer are highly localized and you can mask the attention window with no degradation to performance. A few per layer are responsible for all the information mixing between sequence positions. Re Source vs Destination, as per language model interp destination is the ‘current’ sequence position and source are the position it is attending to.
3a) Didn’t look into this—I think Whisper does speaker diarization but quite badly so I would imagine so
b) Either it hallucinates or it just transcribes one speaker
Either no transcript or hallucinations (eg makes up totally unrelated text)
What would be the purpose of this? - If you mean stitch together the Whisper encoder plus LLama as the decoder then fine-tune the decoder for a specific task this would be very easy (assuming you have enough compute and data)
This post was very helpful to me, thank you. But probably not for the reasons you intended (it made me more technically amibitous in my quest to solve podcast diarization). That said, I do have some questions.
1) Are there any “glitch phonemes” analogous to glitch tokens e.g. SolidGoldMagikarp?
2) I don’t undestand this plot. Is it saying that the model’s attention is sometimes highly nonlocalized? Part of my confusion is that I don’t know what “Souce” and “Destination” means in this case.
3) How does the model handle multi-speaker audio?
3a) Are there features which shift iff the speaker changes? Text which is highly likely given that feature (e.g. “-” to reprsent someone is being cut-off)?
3b) What happens if multiple people are talking at once?
4) What happens when the model is listening to e.g. nature sounds, music, laughter etc.?
5) Unrelated, but how hard would it be to stitch together a whisper and e.g. Llama to make a little multi-modal model?
Not found any yet but that certainly doesn’t mean there aren’t any!
As per my reply to Neel’s comment below yes—most heads ~(4/6) per layer are highly localized and you can mask the attention window with no degradation to performance. A few per layer are responsible for all the information mixing between sequence positions. Re Source vs Destination, as per language model interp destination is the ‘current’ sequence position and source are the position it is attending to. 3a) Didn’t look into this—I think Whisper does speaker diarization but quite badly so I would imagine so b) Either it hallucinates or it just transcribes one speaker
Either no transcript or hallucinations (eg makes up totally unrelated text)
What would be the purpose of this? - If you mean stitch together the Whisper encoder plus LLama as the decoder then fine-tune the decoder for a specific task this would be very easy (assuming you have enough compute and data)