For example, given the image above, the “first” layer of the network outputs: “city”, the second one outputs “city with vegetation”, the third one “view of a city with vegetation from a balcony”, the fourth one “view of a city with skyscrapers on the background and with vegetation from a balcony”.
This could be done by starting with a blank description and repeating many times a “detailer” network that should add details to a description given an image.
This should help interpret-ability and thus safety because you are dividing the very complex task of image description into an iteration of a simpler task, of improving an already existing description. Also it would allow you to see exactly where the model went wrong in the case of wrong outputs. Also it might be possible to share weights between the layers to further simplify the model.
The performance can be a weighted average of the final performance and how uniformly we go from totally random to correct. For example if we have 10 refinement models the optimal score in this category can be had when each refinement block reduces the distance from the initial text encoding random vector to the final one by 10% of the original distance each time. This should make sure that the process is in fact gradual, and not that for example, the last two layers do all the work and everything before is just the identity. Also maybe it should not be a linear scale but a logarithmic scale because the final refinements might be harder to do than the initial ones.
Imagenet/any image classification dataset: just treat the labels as text, this should be used sparingly as otherwise the model will learn to just output single words.
Also in the performance metric, the sum of the performance of each layer should probably be weighted to give less importance to the initial layers, otherwise we encourage the models to do as much of the work as possible at the start instead of being gradual.
Image to text model with successive refinements:
For example, given the image above, the “first” layer of the network outputs: “city”, the second one outputs “city with vegetation”, the third one “view of a city with vegetation from a balcony”, the fourth one “view of a city with skyscrapers on the background and with vegetation from a balcony”.
This could be done by starting with a blank description and repeating many times a “detailer” network that should add details to a description given an image.
This should help interpret-ability and thus safety because you are dividing the very complex task of image description into an iteration of a simpler task, of improving an already existing description. Also it would allow you to see exactly where the model went wrong in the case of wrong outputs. Also it might be possible to share weights between the layers to further simplify the model.
Thank you. This is a good project idea, but there is no clear metric of performance, so it’s not a good hackathon idea.
The performance can be a weighted average of the final performance and how uniformly we go from totally random to correct. For example if we have 10 refinement models the optimal score in this category can be had when each refinement block reduces the distance from the initial text encoding random vector to the final one by 10% of the original distance each time. This should make sure that the process is in fact gradual, and not that for example, the last two layers do all the work and everything before is just the identity. Also maybe it should not be a linear scale but a logarithmic scale because the final refinements might be harder to do than the initial ones.
Okay, this kind of metric could maybe work. The metric could be
sum of the performance of each layer + regularization function of the size of the text proportional to the indice of the layer.
I’m not super familiar with those kinds of image to text models. Can you provide an example of a dataset or a GitHub model doing this ?
Yes of course:
Models:
https://paperswithcode.com/task/image-captioning
Datasets:
Laion 400 millions or other sizes: https://laion.ai/blog/laion-400-open-dataset/
https://paperswithcode.com/dataset/coco-captions
Imagenet/any image classification dataset: just treat the labels as text, this should be used sparingly as otherwise the model will learn to just output single words.
Also in the performance metric, the sum of the performance of each layer should probably be weighted to give less importance to the initial layers, otherwise we encourage the models to do as much of the work as possible at the start instead of being gradual.