You don’t need to guess; it’s clearly true. Even a 1 trillion parameter network where each parameter is represented with 64 bits can still only represent at most 264,000,000,000,000 different functions, which is a tiny tiny fraction of the full space of 228,000,000 possible functions. You’re already getting at least 28,000,000−64,000,000,000,000 of the bits just by choosing the network architecture.
(This does assume things like “the neural network can learn the correct function rather than a nearly-correct function” but similarly the argument in the OP assumes “the toddler does learn the correct function rather than a nearly-correct function”.)
By the time you’re talking about data with forty binary attributes, the number of possible examples is past a trillion—but the number of possible concepts is past two-to-the-trillionth-power. To narrow down that superexponential concept space, you’d have to see over a trillion examples before you could say what was In, and what was Out. You’d have to see every possible example, in fact.
[...]
From this perspective, learning doesn’t just rely oninductive bias, it is nearly all inductive bias—when you compare the number of concepts ruled out a priori, to those ruled out by mere evidence.
You don’t need to guess; it’s clearly true. Even a 1 trillion parameter network where each parameter is represented with 64 bits can still only represent at most 264,000,000,000,000 different functions, which is a tiny tiny fraction of the full space of 228,000,000 possible functions. You’re already getting at least 28,000,000−64,000,000,000,000 of the bits just by choosing the network architecture.
(This does assume things like “the neural network can learn the correct function rather than a nearly-correct function” but similarly the argument in the OP assumes “the toddler does learn the correct function rather than a nearly-correct function”.)
See also Superexponential Concept Space, and Simple Words, from the Sequences: