Bruce W. Lee comments on [Paper] Language Models Don’t Learn the Physical Manifestation of Language

Bruce W. Lee 23 Feb 2024 18:14 UTC
1 point
0
Regarding the visual instruction tuning paper, see (https://arxiv.org/pdf/2402.11349.pdf, Table 5). Though this experiment on multi-modality was rather simple, I think it does show that it’s not a convenient way to improve on H-Test.
- Ann 23 Feb 2024 20:12 UTC
  1 point
  0
  Parent
  Yeah; I do wonder just how qualitatively different GPT4 or Gemini’s multimodality is from the ‘glue a vision classifier on then train it’ method LLaVa uses, since I don’t think we have specifics. Suspect it trained on image data from the start or near it rather than gluing two different transformers together, but hard to be sure.