A better approach IMO is to directly tokenize audio and then find a clever way to align text tokens with audio tokens during training, without relying on 100% transcription.
A better approach IMO is to directly tokenize audio and then find a clever way to align text tokens with audio tokens during training, without relying on 100% transcription.