Jacob G-W comments on Google Gemini Announced

Jacob G-W 6 Dec 2023 18:34 UTC
12 points
0
Sure, but they only use 16 frames, which doesn’t really seem like it’s “video” to me.

Understanding video input is an important step towards a useful generalist agent. We measure the video understanding capability across several established benchmarks that are held-out from training. These tasks measure whether the model is able to understand and reason over a temporally-related sequence of frames. For each video task, we sample 16 equally-spaced frames from each video clip and feed them to the Gemini models. For the YouTube video datasets (all datasets except NextQA and the Perception test), we evaluate the Gemini models on videos that were still publicly available in the month of November, 2023