I predict that instead of LLMs being trained on ASR-generated text, instead they will be upgraded to be multimodal, and trained on audio and video directly in addition to text.
This direction makes sense to me, since these models have huge capacity, much greater than the ASR models do. Why chain multiple systems if you can learn directly from the data?
I do agree with your underlying point that there’s massive amounts of audio and video data that haven’t been used much yet, and those are a great and growing resource for LLM training.
I predict that instead of LLMs being trained on ASR-generated text, instead they will be upgraded to be multimodal, and trained on audio and video directly in addition to text.
Google has already discussed this publicly, e.g. here: https://blog.google/products/search/introducing-mum/.
This direction makes sense to me, since these models have huge capacity, much greater than the ASR models do. Why chain multiple systems if you can learn directly from the data?
I do agree with your underlying point that there’s massive amounts of audio and video data that haven’t been used much yet, and those are a great and growing resource for LLM training.