IL comments on chinchilla’s wild implications

IL Jul 31, 2022, 6:26 PM
LW: 63 AF: 17
19
AF
When you exhaust all the language data from text, you can start extracting language from audio and video.
As far as I know the largest public repository of audio and video is YouTube. We can do a rough back-of-the-envelope computation for how much data is in there:
- According to some 2019 article I found, in every minute 50 hours of video are uploaded to YouTube. If we assume this was the average for the last 15 years, that gets us 200 billion minutes of video.
- An average conversation has 150 words per minute, according to a Google search. That gets us 30T words, or 30T tokens if we assume 1 token per word (is this right?)
- Let’s say 1% of that is actually useful, so that gets us 300B tokens, which is… a lot less than I expected.
So it seems like video doesn’t save us, if we just use it for the language data. We could do self-supervised learning on the video data, but for that we need to know the scaling laws for video (has anyone done that?).
What links here?
- Mo Putera's comment on Trends in Training Dataset Sizes by Pablo Villalobos (Sep 22, 2022, 9:38 AM; 1 point)
- nostalgebraist Aug 2, 2022, 12:17 AM
  LW: 21 AF: 5
  10
  AF Parent
  Very interesting!
  There are a few things in the calculation that seem wrong to me:
  - If I did things right,15 years * (365 days/yr) * (24 hours/day) * (60 mins/hour) * (50 youtube!hours / min) * (60 youtube!mins / youtube!hour) = 24B youtube!minutes, not 200B.
  - I’d expect much less than 100% of Youtube video time to contain speech. I don’t know what a reasonable discount for this would be, though.
  - In the opposite direction, 1% useful seems too low. IIRC, web scrape quality pruning discards less than 99%, and this data is less messy than a web scrape.
  In any case, yeah, this does not seem like a huge amount of data. But there’s enough order-of-magnitude fuzziness in the estimate that it does seem like it’s worth someone’s time to look into more seriously.
- Sam Bowman Aug 2, 2022, 12:04 AM
  LW: 17 AF: 7
  16
  AF Parent
  I agree that this points in the direction of video becoming increasingly important.
  But why assume only 1% is useful? And more importantly, why use only the language data? Even if we don’t have the scaling laws, but it seems pretty clear that there’s a ton of information in the non-language parts of videos that’d be useful to a general-purpose agent—almost certainly more than in the language parts. (Of course, it’ll take more computation to extract the same amount of useful information from video than from text.)