Yeah; I do wonder just how qualitatively different GPT4 or Gemini’s multimodality is from the ‘glue a vision classifier on then train it’ method LLaVa uses, since I don’t think we have specifics. Suspect it trained on image data from the start or near it rather than gluing two different transformers together, but hard to be sure.
Yeah; I do wonder just how qualitatively different GPT4 or Gemini’s multimodality is from the ‘glue a vision classifier on then train it’ method LLaVa uses, since I don’t think we have specifics. Suspect it trained on image data from the start or near it rather than gluing two different transformers together, but hard to be sure.