Ann comments on Language Models Don’t Learn the Physical Manifestation of Language

Ann 23 Feb 2024 12:46 UTC
2 points
1
GPT-4 has vision multimodality, in terms of being able to take image input, but it uses DALLE for image generation.
- Chris_Leong 23 Feb 2024 14:56 UTC
  2 points
  0
  Parent
  Does GPT-4 directly handle the image input or is it converted to text by a separate model then fed into GPT-4?
  - Ann 23 Feb 2024 16:33 UTC
    3 points
    0
    Parent
    Directly handles the image input. Transformers in general are quite flexible in what data they handle, but it may not have been trained to generate (or good at generating) image data.
    - Bruce W. Lee 23 Feb 2024 17:06 UTC
      1 point
      0
      Parent
      Out of genuine curiosity, can you link to your sources?
      - Ann 23 Feb 2024 17:30 UTC
        1 point
        0
        Parent
        https://platform.openai.com/docs/guides/vision and https://openai.com/contributions/gpt-4v are good places to start. https://arxiv.org/abs/2303.08774 is specific in the abstract that the model “can accept image and text inputs and produce text outputs”.
        
        … Not certain the best place to start with multimodal transformers in general. Transformers can work with all kinds of data, and there’s a variety of approaches to multimodality.
        
        Edit: This one—https://arxiv.org/abs/2304.08485 - which gets into the weeds of implementation, does seem to in a sense glue two models together and train them from there; but it’s not so much connecting different models as mapping image data to language embeddings. (And they are the same model.)
        Bruce W. Lee 23 Feb 2024 18:14 UTC
        1 point
        0
        Parent
        Regarding the visual instruction tuning paper, see (https://arxiv.org/pdf/2402.11349.pdf, Table 5). Though this experiment on multi-modality was rather simple, I think it does show that it’s not a convenient way to improve on H-Test.
        Ann 23 Feb 2024 20:12 UTC
        1 point
        0
        Parent
        Yeah; I do wonder just how qualitatively different GPT4 or Gemini’s multimodality is from the ‘glue a vision classifier on then train it’ method LLaVa uses, since I don’t think we have specifics. Suspect it trained on image data from the start or near it rather than gluing two different transformers together, but hard to be sure.