Q: Did anyone train an AI on video sequences where associated caption (descriptive, mostly) is given or generated from another system so that consequently, when the new system gets capable of: + describe a given scene accurately
+ predict movements with both visual and/or textual form/representation
+ evaluate questions concerning the material/visible world, e.g. Does a fridge have wheels? Which animals do we most likely to see on a flower?
Q: Did anyone train an AI on video sequences where associated caption (descriptive, mostly) is given or generated from another system so that consequently, when the new system gets capable of:
+ describe a given scene accurately
+ predict movements with both visual and/or textual form/representation
+ evaluate questions concerning the material/visible world, e.g. Does a fridge have wheels? Which animals do we most likely to see on a flower?
?