My first guess is that I would prefer just calling them “Multimodals”. Or perhaps “Image/Text Multimodals”.
My first guess is that I would prefer just calling them “Multimodals”. Or perhaps “Image/Text Multimodals”.