Imagenet/any image classification dataset: just treat the labels as text, this should be used sparingly as otherwise the model will learn to just output single words.
Also in the performance metric, the sum of the performance of each layer should probably be weighted to give less importance to the initial layers, otherwise we encourage the models to do as much of the work as possible at the start instead of being gradual.
Okay, this kind of metric could maybe work. The metric could be
sum of the performance of each layer + regularization function of the size of the text proportional to the indice of the layer.
I’m not super familiar with those kinds of image to text models. Can you provide an example of a dataset or a GitHub model doing this ?
Yes of course:
Models:
https://paperswithcode.com/task/image-captioning
Datasets:
Laion 400 millions or other sizes: https://laion.ai/blog/laion-400-open-dataset/
https://paperswithcode.com/dataset/coco-captions
Imagenet/any image classification dataset: just treat the labels as text, this should be used sparingly as otherwise the model will learn to just output single words.
Also in the performance metric, the sum of the performance of each layer should probably be weighted to give less importance to the initial layers, otherwise we encourage the models to do as much of the work as possible at the start instead of being gradual.