Bruce W. Lee comments on Language Models Don’t Learn the Physical Manifestation of Language

Bruce W. Lee 23 Feb 2024 17:06 UTC
1 point
0
Out of genuine curiosity, can you link to your sources?
- Ann 23 Feb 2024 17:30 UTC
  1 point
  0
  Parent
  https://platform.openai.com/docs/guides/vision and https://openai.com/contributions/gpt-4v are good places to start. https://arxiv.org/abs/2303.08774 is specific in the abstract that the model “can accept image and text inputs and produce text outputs”.
  
  … Not certain the best place to start with multimodal transformers in general. Transformers can work with all kinds of data, and there’s a variety of approaches to multimodality.
  
  Edit: This one—https://arxiv.org/abs/2304.08485 - which gets into the weeds of implementation, does seem to in a sense glue two models together and train them from there; but it’s not so much connecting different models as mapping image data to language embeddings. (And they are the same model.)
  - Bruce W. Lee 23 Feb 2024 18:14 UTC
    1 point
    0
    Parent
    Regarding the visual instruction tuning paper, see (https://arxiv.org/pdf/2402.11349.pdf, Table 5). Though this experiment on multi-modality was rather simple, I think it does show that it’s not a convenient way to improve on H-Test.
    - Ann 23 Feb 2024 20:12 UTC
      1 point
      0
      Parent
      Yeah; I do wonder just how qualitatively different GPT4 or Gemini’s multimodality is from the ‘glue a vision classifier on then train it’ method LLaVa uses, since I don’t think we have specifics. Suspect it trained on image data from the start or near it rather than gluing two different transformers together, but hard to be sure.