Interesting. My previous upper estimate of human lifetime training token equivalent was ~10B tokens: (300 WPM ~ 10T/s * 1e9/s ), so 2B tokens makes sense if people are reading/listening only 20% of the time.
So the next remaining question is what causes human performance on downstream tasks to scale much more quickly as a function of tokens or token perplexity (as LLMs need 100B+, perhaps 1T+ tokens to approach human level). I’m guessing it’s some mix of:
Active curriculum learning focusing capacity on important knowledge quanta vs trivia
Embodiment & RL—humans use semantic knowledge to improve world models jointly trained via acting in the world, and perhaps have more complex decoding/mapping
Interesting. My previous upper estimate of human lifetime training token equivalent was ~10B tokens: (300 WPM ~ 10T/s * 1e9/s ), so 2B tokens makes sense if people are reading/listening only 20% of the time.
So the next remaining question is what causes human performance on downstream tasks to scale much more quickly as a function of tokens or token perplexity (as LLMs need 100B+, perhaps 1T+ tokens to approach human level). I’m guessing it’s some mix of:
Active curriculum learning focusing capacity on important knowledge quanta vs trivia
Embodiment & RL—humans use semantic knowledge to improve world models jointly trained via acting in the world, and perhaps have more complex decoding/mapping
Grokking vs shallow near memorization