Ann comments on Language Models Don’t Learn the Physical Manifestation of Language

Ann 23 Feb 2024 20:12 UTC
1 point
0
Yeah; I do wonder just how qualitatively different GPT4 or Gemini’s multimodality is from the ‘glue a vision classifier on then train it’ method LLaVa uses, since I don’t think we have specifics. Suspect it trained on image data from the start or near it rather than gluing two different transformers together, but hard to be sure.