Jacob G-W comments on Google Gemini Announced

Jacob G-W Dec 6, 2023, 4:39 PM
7 points
0
Update, it seems that the video generation capability is just accomplished by feeding still frames of the video into the model, not by any native video generation.
- sludgepuddle Dec 6, 2023, 5:11 PM
  10 points
  3
  Parent
  A sequence of still frames is a video, if the model was trained on ordered sequences of still frames crammed into the context window, as claimed by the technical report, then it understands video natively. And it would be surprising if it didn’t also have some capability for generating video. I’m not sure why audio/video generation isn’t mentioned, perhaps the performance in these arenas is not competitive with other models
  - Jacob G-W Dec 6, 2023, 6:34 PM
    12 points
    0
    Parent
    Sure, but they only use 16 frames, which doesn’t really seem like it’s “video” to me.
    
    Understanding video input is an important step towards a useful generalist agent. We measure the video understanding capability across several established benchmarks that are held-out from training. These tasks measure whether the model is able to understand and reason over a temporally-related sequence of frames. For each video task, we sample 16 equally-spaced frames from each video clip and feed them to the Gemini models. For the YouTube video datasets (all datasets except NextQA and the Perception test), we evaluate the Gemini models on videos that were still publicly available in the month of November, 2023
  - Tao Lin Dec 6, 2023, 11:31 PM
    6 points
    0
    Parent
    video can get extremely expensive without specific architectural support. Eg a folder of images takes up >10x the space of the equivalent video, and using eg 1000 tokens per frame for 30 frames/second is a lot of compute