What does it mean for perception to compress a frame of video to 1k tokens? What kind of information gets lost when you do this?
What does it mean for perception to compress a frame of video to 1k tokens? What kind of information gets lost when you do this?