When you exhaust all the language data from text, you can start extracting language from audio and video.
As far as I know the largest public repository of audio and video is YouTube. We can do a rough back-of-the-envelope computation for how much data is in there:
According to some 2019 article I found, in every minute 50 hours of video are uploaded to YouTube. If we assume this was the average for the last 15 years, that gets us 200 billion minutes of video.
An average conversation has 150 words per minute, according to a Google search. That gets us 30T words, or 30T tokens if we assume 1 token per word (is this right?)
Let’s say 1% of that is actually useful, so that gets us 300B tokens, which is… a lot less than I expected.
So it seems like video doesn’t save us, if we just use it for the language data. We could do self-supervised learning on the video data, but for that we need to know the scaling laws for video (has anyone done that?).
There are a few things in the calculation that seem wrong to me:
If I did things right,15 years * (365 days/yr) * (24 hours/day) * (60 mins/hour) * (50 youtube!hours / min) * (60 youtube!mins / youtube!hour) = 24B youtube!minutes, not 200B.
I’d expect much less than 100% of Youtube video time to contain speech. I don’t know what a reasonable discount for this would be, though.
In the opposite direction, 1% useful seems too low. IIRC, web scrape quality pruning discards less than 99%, and this data is less messy than a web scrape.
In any case, yeah, this does not seem like a huge amount of data. But there’s enough order-of-magnitude fuzziness in the estimate that it does seem like it’s worth someone’s time to look into more seriously.
I agree that this points in the direction of video becoming increasingly important.
But why assume only 1% is useful? And more importantly, why use only the language data? Even if we don’t have the scaling laws, but it seems pretty clear that there’s a ton of information in the non-language parts of videos that’d be useful to a general-purpose agent—almost certainly more than in the language parts. (Of course, it’ll take more computation to extract the same amount of useful information from video than from text.)
When you exhaust all the language data from text, you can start extracting language from audio and video.
As far as I know the largest public repository of audio and video is YouTube. We can do a rough back-of-the-envelope computation for how much data is in there:
According to some 2019 article I found, in every minute 50 hours of video are uploaded to YouTube. If we assume this was the average for the last 15 years, that gets us 200 billion minutes of video.
An average conversation has 150 words per minute, according to a Google search. That gets us 30T words, or 30T tokens if we assume 1 token per word (is this right?)
Let’s say 1% of that is actually useful, so that gets us 300B tokens, which is… a lot less than I expected.
So it seems like video doesn’t save us, if we just use it for the language data. We could do self-supervised learning on the video data, but for that we need to know the scaling laws for video (has anyone done that?).
Very interesting!
There are a few things in the calculation that seem wrong to me:
If I did things right,15 years * (365 days/yr) * (24 hours/day) * (60 mins/hour) * (50 youtube!hours / min) * (60 youtube!mins / youtube!hour) = 24B youtube!minutes, not 200B.
I’d expect much less than 100% of Youtube video time to contain speech. I don’t know what a reasonable discount for this would be, though.
In the opposite direction, 1% useful seems too low. IIRC, web scrape quality pruning discards less than 99%, and this data is less messy than a web scrape.
In any case, yeah, this does not seem like a huge amount of data. But there’s enough order-of-magnitude fuzziness in the estimate that it does seem like it’s worth someone’s time to look into more seriously.
I agree that this points in the direction of video becoming increasingly important.
But why assume only 1% is useful? And more importantly, why use only the language data? Even if we don’t have the scaling laws, but it seems pretty clear that there’s a ton of information in the non-language parts of videos that’d be useful to a general-purpose agent—almost certainly more than in the language parts. (Of course, it’ll take more computation to extract the same amount of useful information from video than from text.)