I would expect that they fare much better with a text representation. I’m not too familiar with how multimodality works exactly, but kind of assume that “vision” works very differently from our intuitive understanding of it. When we are asked such a question, we look at the image and start scanning it with the problem in mind. Whereas transformers seem like they just have some rather vague “conceptual summary” of the image available, with many details, but maybe not all for any possible question, and then have to work with that very limited representation.
Maybe somebody more knowledgeable can comment on how accurate that is. And whether we can expect scaling to eventually just basically solve this problem, or some different mitigation will be needed.
I would expect that they fare much better with a text representation. I’m not too familiar with how multimodality works exactly, but kind of assume that “vision” works very differently from our intuitive understanding of it. When we are asked such a question, we look at the image and start scanning it with the problem in mind. Whereas transformers seem like they just have some rather vague “conceptual summary” of the image available, with many details, but maybe not all for any possible question, and then have to work with that very limited representation. Maybe somebody more knowledgeable can comment on how accurate that is. And whether we can expect scaling to eventually just basically solve this problem, or some different mitigation will be needed.