sludgepuddle comments on Google Gemini Announced

sludgepuddle 6 Dec 2023 17:11 UTC
10 points
3
A sequence of still frames is a video, if the model was trained on ordered sequences of still frames crammed into the context window, as claimed by the technical report, then it understands video natively. And it would be surprising if it didn’t also have some capability for generating video. I’m not sure why audio/video generation isn’t mentioned, perhaps the performance in these arenas is not competitive with other models
- Jacob G-W 6 Dec 2023 18:34 UTC
  12 points
  0
  Parent
  Sure, but they only use 16 frames, which doesn’t really seem like it’s “video” to me.
  
  Understanding video input is an important step towards a useful generalist agent. We measure the video understanding capability across several established benchmarks that are held-out from training. These tasks measure whether the model is able to understand and reason over a temporally-related sequence of frames. For each video task, we sample 16 equally-spaced frames from each video clip and feed them to the Gemini models. For the YouTube video datasets (all datasets except NextQA and the Perception test), we evaluate the Gemini models on videos that were still publicly available in the month of November, 2023
- Tao Lin 6 Dec 2023 23:31 UTC
  6 points
  0
  Parent
  video can get extremely expensive without specific architectural support. Eg a folder of images takes up >10x the space of the equivalent video, and using eg 1000 tokens per frame for 30 frames/second is a lot of compute