We’ve decoded much of the brain, but it’s still mysterious what the brain’s backprop equivalent learning algorithm is, and how it seems to learn so quickly at batch size 1, sidestepping all these gradient noise considerations.
A human may read/hear/think on order a billion-ish words per lifetime or less? GPT-3 trained on a few OOM more, and still would require many OOM more compute/data to hit human perf. Deepmind’s atari agents need about 10^8 frames to match humans and thus are roughly ~3 OOM less data efficient, ignoring human pretraining (true also for EZ, it just uses simulated frames).
Although if you factor in 10 years of human pretraining that’s about 10^8 seconds—so perhaps a big chunk of it is just generic multimodal curriculum pretraining.
We’ve decoded much of the brain, but it’s still mysterious what the brain’s backprop equivalent learning algorithm is, and how it seems to learn so quickly at batch size 1, sidestepping all these gradient noise considerations.
A human may read/hear/think on order a billion-ish words per lifetime or less? GPT-3 trained on a few OOM more, and still would require many OOM more compute/data to hit human perf. Deepmind’s atari agents need about 10^8 frames to match humans and thus are roughly ~3 OOM less data efficient, ignoring human pretraining (true also for EZ, it just uses simulated frames).
Although if you factor in 10 years of human pretraining that’s about 10^8 seconds—so perhaps a big chunk of it is just generic multimodal curriculum pretraining.