Not video transcripts—video. 1 frame of Video contains much more data than 1 text token, and you can train an AI as a next frame predictor much as you can a next token predictor.
Not video transcripts—video. 1 frame of Video contains much more data than 1 text token, and you can train an AI as a next frame predictor much as you can a next token predictor.