I’m picturing something along the lines of “Pick the odd one out, from these three 10-second video clips”, where the clips are two different examples from some broad genre (birthday party, tennis match, wildlife, city street, etc etc) and one from another.
I might be behind the times though, or underestimating the success rate you’d get by classifying based on, say, still images taken from one random frame of the video.
But maybe if you added static noise to make the videos heavily obscured, and rely on a human ability to infer missing details and fill in noisy visual inputs.
I think “video reasoning” could be an interesting approach as you say.
Like if there are 10 frames and no single frame shows a tennis racket, but if you play them real fast, a human could infer there being a tennis racket because part of the racket is in each frame.
How does AI do at classifying video these days?
I’m picturing something along the lines of “Pick the odd one out, from these three 10-second video clips”, where the clips are two different examples from some broad genre (birthday party, tennis match, wildlife, city street, etc etc) and one from another.
I might be behind the times though, or underestimating the success rate you’d get by classifying based on, say, still images taken from one random frame of the video.
But maybe if you added static noise to make the videos heavily obscured, and rely on a human ability to infer missing details and fill in noisy visual inputs.
I think “video reasoning” could be an interesting approach as you say.
Like if there are 10 frames and no single frame shows a tennis racket, but if you play them real fast, a human could infer there being a tennis racket because part of the racket is in each frame.