Sure, but they only use 16 frames, which doesn’t really seem like it’s “video” to me.
Understanding video input is an important step towards a useful generalist agent. We measure the
video understanding capability across several established benchmarks that are held-out from training.
These tasks measure whether the model is able to understand and reason over a temporally-related
sequence of frames. For each video task, we sample 16 equally-spaced frames from each video clip
and feed them to the Gemini models. For the YouTube video datasets (all datasets except NextQA
and the Perception test), we evaluate the Gemini models on videos that were still publicly available
in the month of November, 2023
Sure, but they only use 16 frames, which doesn’t really seem like it’s “video” to me.