Update, it seems that the video generation capability is just accomplished by feeding still frames of the video into the model, not by any native video generation.
A sequence of still frames is a video, if the model was trained on ordered sequences of still frames crammed into the context window, as claimed by the technical report, then it understands video natively. And it would be surprising if it didn’t also have some capability for generating video. I’m not sure why audio/video generation isn’t mentioned, perhaps the performance in these arenas is not competitive with other models
Sure, but they only use 16 frames, which doesn’t really seem like it’s “video” to me.
Understanding video input is an important step towards a useful generalist agent. We measure the
video understanding capability across several established benchmarks that are held-out from training.
These tasks measure whether the model is able to understand and reason over a temporally-related
sequence of frames. For each video task, we sample 16 equally-spaced frames from each video clip
and feed them to the Gemini models. For the YouTube video datasets (all datasets except NextQA
and the Perception test), we evaluate the Gemini models on videos that were still publicly available
in the month of November, 2023
video can get extremely expensive without specific architectural support. Eg a folder of images takes up >10x the space of the equivalent video, and using eg 1000 tokens per frame for 30 frames/second is a lot of compute
Update, it seems that the video generation capability is just accomplished by feeding still frames of the video into the model, not by any native video generation.
A sequence of still frames is a video, if the model was trained on ordered sequences of still frames crammed into the context window, as claimed by the technical report, then it understands video natively. And it would be surprising if it didn’t also have some capability for generating video. I’m not sure why audio/video generation isn’t mentioned, perhaps the performance in these arenas is not competitive with other models
Sure, but they only use 16 frames, which doesn’t really seem like it’s “video” to me.
video can get extremely expensive without specific architectural support. Eg a folder of images takes up >10x the space of the equivalent video, and using eg 1000 tokens per frame for 30 frames/second is a lot of compute