That is pretty much how text-image and multimodal languages work now, really. You get a giant inscrutable vector embedding from one modality’s model and another giant inscrutable model spits out the corresponding output in a different modality. It seems to work well for all the usual tasks like captioning or question-answering...
That is pretty much how text-image and multimodal languages work now, really. You get a giant inscrutable vector embedding from one modality’s model and another giant inscrutable model spits out the corresponding output in a different modality. It seems to work well for all the usual tasks like captioning or question-answering...