Directly handles the image input. Transformers in general are quite flexible in what data they handle, but it may not have been trained to generate (or good at generating) image data.
… Not certain the best place to start with multimodal transformers in general. Transformers can work with all kinds of data, and there’s a variety of approaches to multimodality.
Edit: This one—https://arxiv.org/abs/2304.08485 - which gets into the weeds of implementation, does seem to in a sense glue two models together and train them from there; but it’s not so much connecting different models as mapping image data to language embeddings. (And they are the same model.)
Regarding the visual instruction tuning paper, see (https://arxiv.org/pdf/2402.11349.pdf, Table 5). Though this experiment on multi-modality was rather simple, I think it does show that it’s not a convenient way to improve on H-Test.
Yeah; I do wonder just how qualitatively different GPT4 or Gemini’s multimodality is from the ‘glue a vision classifier on then train it’ method LLaVa uses, since I don’t think we have specifics. Suspect it trained on image data from the start or near it rather than gluing two different transformers together, but hard to be sure.
Does GPT-4 directly handle the image input or is it converted to text by a separate model then fed into GPT-4?
Directly handles the image input. Transformers in general are quite flexible in what data they handle, but it may not have been trained to generate (or good at generating) image data.
Out of genuine curiosity, can you link to your sources?
https://platform.openai.com/docs/guides/vision and https://openai.com/contributions/gpt-4v are good places to start. https://arxiv.org/abs/2303.08774 is specific in the abstract that the model “can accept image and text inputs and produce text outputs”.
… Not certain the best place to start with multimodal transformers in general. Transformers can work with all kinds of data, and there’s a variety of approaches to multimodality.
Edit: This one—https://arxiv.org/abs/2304.08485 - which gets into the weeds of implementation, does seem to in a sense glue two models together and train them from there; but it’s not so much connecting different models as mapping image data to language embeddings. (And they are the same model.)
Regarding the visual instruction tuning paper, see (https://arxiv.org/pdf/2402.11349.pdf, Table 5). Though this experiment on multi-modality was rather simple, I think it does show that it’s not a convenient way to improve on H-Test.
Yeah; I do wonder just how qualitatively different GPT4 or Gemini’s multimodality is from the ‘glue a vision classifier on then train it’ method LLaVa uses, since I don’t think we have specifics. Suspect it trained on image data from the start or near it rather than gluing two different transformers together, but hard to be sure.