I wonder if with the next generations of multimodal models we’ll see a “rubber ducking” phenomenon where, because their self-attention was spread across mediums, things like CoT and using outputs as a scratch pad will have a significantly improved performance in non-text streams.
Will GPT-4o fed its own auditory outputs with tonal cues and pauses and processed as an audio data stream make connections or leaps it never would if just fed its own text outputs as context?
I think this will be the case, and suspect the various firms dedicating themselves to virtualized human avatars will accidentally stumble into profitable niches—not for providing humans virtual AI clones as an interface, but for providing AIs virtual human clones as an interface. (Which is a bit frustrating, as I really loathe that market segment right now.)
When I think about how Sci-Fi authors projected the future of AI cross- or self-talk, it was towards a super-efficient beeping or binary transmission of pure data betwixt them.
But I increasingly get the sense that, like much of actual AI development over the past few years, a lot of the Sci-Fi thinking was tangential or inverse to the actual vector of progress, particularly in underestimating the inherent value humans bring to bear. The wonders we see developing around us are jumpstarted and continually enabled by the patterns woven by ourselves, and it seems at least the near future developments of models will be conforming to those patterns more and more, not less and less.
Still, it’s going to be bizarre as heck to watch a multimodal model’s avatar debating itself aloud like I do in my kitchen...
I wonder if with the next generations of multimodal models we’ll see a “rubber ducking” phenomenon where, because their self-attention was spread across mediums, things like CoT and using outputs as a scratch pad will have a significantly improved performance in non-text streams.
Will GPT-4o fed its own auditory outputs with tonal cues and pauses and processed as an audio data stream make connections or leaps it never would if just fed its own text outputs as context?
I think this will be the case, and suspect the various firms dedicating themselves to virtualized human avatars will accidentally stumble into profitable niches—not for providing humans virtual AI clones as an interface, but for providing AIs virtual human clones as an interface. (Which is a bit frustrating, as I really loathe that market segment right now.)
When I think about how Sci-Fi authors projected the future of AI cross- or self-talk, it was towards a super-efficient beeping or binary transmission of pure data betwixt them.
But I increasingly get the sense that, like much of actual AI development over the past few years, a lot of the Sci-Fi thinking was tangential or inverse to the actual vector of progress, particularly in underestimating the inherent value humans bring to bear. The wonders we see developing around us are jumpstarted and continually enabled by the patterns woven by ourselves, and it seems at least the near future developments of models will be conforming to those patterns more and more, not less and less.
Still, it’s going to be bizarre as heck to watch a multimodal model’s avatar debating itself aloud like I do in my kitchen...
When I wrote this I thought OAI was sort of fudging the audio output and was using SSML as an intermediate step.
After seeing details in the system card, such as copying user voice, it’s clearly not fudging.
Which makes me even more sure the above is going to end up prophetically correct.