I was wondering if the finding “words are clustered by vocal and semantic similarity” also exists in traditional LLMs? I don’t remember seeing that, so could it mean that this modularity could also make interpretability easier?
It seems logical: we have more structure on the data, so better way to cluster the text, but I’m curious of your opinion.
I wouldn’t expect an LLM to do this. An LLM wants to predict the most likely next word, so is going to assign high probabilities to semantically similar words (hence why they are clustered in embedding space). Whisper is trying to do speech-to-text, so as well as needing to know about semantic similarity of words it also needs to know about words that sound the same. Eg if it thinks it heard ‘rug’, it is pretty likely that the person speaking actually said ‘mug’ hence these words are clustered. Does that make sense?
Thanks for the post Ellena!
I was wondering if the finding “words are clustered by vocal and semantic similarity” also exists in traditional LLMs? I don’t remember seeing that, so could it mean that this modularity could also make interpretability easier?
It seems logical: we have more structure on the data, so better way to cluster the text, but I’m curious of your opinion.
I wouldn’t expect an LLM to do this. An LLM wants to predict the most likely next word, so is going to assign high probabilities to semantically similar words (hence why they are clustered in embedding space). Whisper is trying to do speech-to-text, so as well as needing to know about semantic similarity of words it also needs to know about words that sound the same. Eg if it thinks it heard ‘rug’, it is pretty likely that the person speaking actually said ‘mug’ hence these words are clustered. Does that make sense?