At first, I assumed that the demo video, which is also sped up, was in fact only using single frames, but there is a scene where the guy is shuffling three cups and Gemini correctly figures out where the coin went. So it seems it can do video—at least in the sense of using a sequence of frames.
I think the video is mostly faked as a sequence of things Gemini can kind of sort of do. In the blog post they do it with few shot prompting and 3 screenshots, and say gemini sometimes gets it wrong:
At first, I assumed that the demo video, which is also sped up, was in fact only using single frames, but there is a scene where the guy is shuffling three cups and Gemini correctly figures out where the coin went. So it seems it can do video—at least in the sense of using a sequence of frames.
I think the video is mostly faked as a sequence of things Gemini can kind of sort of do. In the blog post they do it with few shot prompting and 3 screenshots, and say gemini sometimes gets it wrong:
https://developers.googleblog.com/2023/12/how-its-made-gemini-multimodal-prompting.html?m=1