From Simon Willison’s Weblog it seems to be about 260 tokens per frame, where each frame comes from one second of a video, and each of these frames is being processed the same as any image:
it looks like it really does work by breaking down the video into individual frames and processing each one as an image.
And at the end:
The image input was 258 tokens, the total token count after the response was 410 tokens—so 152 tokens for the response from the model. Those image tokens pack in a lot of information!
But these 152 tokens are just the titles and authors of the books. Information about the order, size, colors, textures, etc. of each book and the other objects on the bookshelf are likely (mostly) all extractable by Gemini. Which would mean that image tokens are much more information dense than regular text tokens.
Someone should probably test how well Gemini 1.5 Pro handles lots of images that are just text as well as images that contain varying levels of text density and see if it’s as simple as putting your text into an image in order to effectively “extend” Gemini’s context window—though likely with a significant drop in performance.
Note: I’ve added the clarification: “where each frame comes from one second of a video”
I misread the section which quotes the Gemini 1.5 technical report; the paper clearly states, multiple times, the video frames are taken at 1FPS. Presumably, given the simple test that Simon Willison performed, it is a single frame that gets taken from each second and not the entire second of video that gets tokenized.
From Simon Willison’s Weblog it seems to be about 260 tokens per frame, where each frame comes from one second of a video, and each of these frames is being processed the same as any image:
And at the end:
But these 152 tokens are just the titles and authors of the books. Information about the order, size, colors, textures, etc. of each book and the other objects on the bookshelf are likely (mostly) all extractable by Gemini. Which would mean that image tokens are much more information dense than regular text tokens.
Someone should probably test how well Gemini 1.5 Pro handles lots of images that are just text as well as images that contain varying levels of text density and see if it’s as simple as putting your text into an image in order to effectively “extend” Gemini’s context window—though likely with a significant drop in performance.
Note: I’ve added the clarification: “where each frame comes from one second of a video”
I misread the section which quotes the Gemini 1.5 technical report; the paper clearly states, multiple times, the video frames are taken at 1FPS. Presumably, given the simple test that Simon Willison performed, it is a single frame that gets taken from each second and not the entire second of video that gets tokenized.