How about audio? Is the speech-to-text domain as “close to the metal” as possible to deserve focus too or did people hit roadblocks that made image generators more attractive? If the latter, where can I read about the lessons learned, please?
I know almost nothing about audio ML, but I would expect one big inconvenience when doing audio-NN-interp to be that a lot of complexity in sound is difficult to represent visually. Images and text (/token strings) don’t have this problem.
How about audio? Is the speech-to-text domain as “close to the metal” as possible to deserve focus too or did people hit roadblocks that made image generators more attractive? If the latter, where can I read about the lessons learned, please?
I know almost nothing about audio ML, but I would expect one big inconvenience when doing audio-NN-interp to be that a lot of complexity in sound is difficult to represent visually. Images and text (/token strings) don’t have this problem.