Since the same transformer architecture works on images with basically no modification, I suspect it would do well on audio prediction too. Finding a really broad representative dataset for speech might be difficult, but I guess audiobooks are a good start. The context window might cause problems, because 2000 byte pairs of text takes up a lot more than 4000 bytes in audio form. But I bet it would be able to mimic voices pretty well even with a small window. (edit: Actually probably not, see Gwern’s answer.)
If your question is whether the trained GPT-3 model could be modified to work with audio, I suspect not. In principle there are layers of abstraction that a transformer should be able to take advantage of, so that word prediction is mostly uncoupled from audio processing, but there’s not a perfect separation, and we wouldn’t know how to interface them. Maybe you could train a separate transformer model that just transcribes audio into text, and stitch them together that way, but there’s not much reason to think it would be a big improvement over existing speech recognition systems.
Since the same transformer architecture works on images with basically no modification, I suspect it would do well on audio prediction too. Finding a really broad representative dataset for speech might be difficult, but I guess audiobooks are a good start. The context window might cause problems, because 2000 byte pairs of text takes up a lot more than 4000 bytes in audio form. But I bet it would be able to mimic voices pretty well even with a small window. (edit: Actually probably not, see Gwern’s answer.)
If your question is whether the trained GPT-3 model could be modified to work with audio, I suspect not. In principle there are layers of abstraction that a transformer should be able to take advantage of, so that word prediction is mostly uncoupled from audio processing, but there’s not a perfect separation, and we wouldn’t know how to interface them. Maybe you could train a separate transformer model that just transcribes audio into text, and stitch them together that way, but there’s not much reason to think it would be a big improvement over existing speech recognition systems.