The scaling laws are extremely well established in DL and there are strong theoretical reasons (and increasingly experimental neurosci evidence) that they are universal to all NNs, and we have good theoretical models of why they arise.
I’m not aware of these—do you have any references?
Both brains and DL systems have fairly simple architectural priors in comparison to the emergent learned complexity
True but misleading? Isn’t the brain’s “architectural prior” a heckuva lot more complex than the things used in DL?
Brains are very slow so have limited combinatorial search, and our search/planning is just short term learning (short/medium term plasticity). Again it’s nearly all learning (synaptic updates).
Sure. The big crux here is whether plasticity of stuff which is normally “locked down” in adulthood is needed to significantly increase “fluid intelligence” (by which I mean, something like, whatever allows people to invent useful new concepts and find ingenious applications of existing concepts). I’m not convinced these DL analogies are useful—what properties do brains and deepnets share that renders the analogies useful here? DL is a pretty specific thing, so by default I’d strongly expect brains to differ in important ways. E.g. what if the structures whose shapes determine the strength of fluid intelligence aren’t actually “locked down”, but reach a genetically-influenced equilibrium by adulthood, and changing the genes changes the equilibrium? E.g. what if working memory capacity is limited by the noisiness of neural transmission, and we can reduce the noisiness through gene edits?
I find the standard arguments for doom implausible—they rely on many assumptions contradicted by deep knowledge of computational neuroscience and DL
FOOM isn’t necessary for doom—the convergent endpoint is that you have dangerously capable minds around: minds which can think much faster and figure out things we can’t. FOOM is one way to get there.
True but misleading? Isn’t the brain’s “architectural prior” a heckuva lot more complex than the things used in DL?
The full specification of the DL system includes the microde, OS, etc. Likewise much of the brain complexity is in the smaller ‘oldbrain’ structures that are the equivalent of a base robot OS. The architectural prior I speak of is the complexity on top of that, which separates us from some ancient earlier vertebrate brain. But again see the brain as a ULM post, which cover the the extensive evidence for emergent learned complexity from simple architecture/algorithms (now the dominant hypothesis in neuroscience).
I’m not convinced these DL analogies are useful—what properties do brains and deepnets share that renders the analogies useful here?
Most everything above the hardware substrate—but i’ve already provided links to sections of my articles addressing the convergence of DL and neurosci with many dozens of references. So it’d probably be better to focus exactly on what specific key analogies/properties you believe diverge.
DL is a pretty specific thing
DL is extremely general—it’s just efficient approximate bayesian inference over circuit spaces. It doesn’t imply any specific architecture, and doesn’t even strongly imply any specific approx inference/learning algorithm (as 1st and approx 2nd order methods are both common).
E.g. what if working memory capacity is limited by the noisiness of neural transmission, and we can reduce the noisiness through gene edits?
Training to increase working memory capacity has near zero effect on IQ or downstream intellectual capabilities—see gwern’s reviews and experiments. Working memory capacity is important in both brains and ANNs (transformers), but it comes from large fast weight synaptic capacity, not simple hacks.
Noise is important for sampling—adequate noise is a feature, not a bug.
I’m not aware of these—do you have any references?
True but misleading? Isn’t the brain’s “architectural prior” a heckuva lot more complex than the things used in DL?
Sure. The big crux here is whether plasticity of stuff which is normally “locked down” in adulthood is needed to significantly increase “fluid intelligence” (by which I mean, something like, whatever allows people to invent useful new concepts and find ingenious applications of existing concepts). I’m not convinced these DL analogies are useful—what properties do brains and deepnets share that renders the analogies useful here? DL is a pretty specific thing, so by default I’d strongly expect brains to differ in important ways. E.g. what if the structures whose shapes determine the strength of fluid intelligence aren’t actually “locked down”, but reach a genetically-influenced equilibrium by adulthood, and changing the genes changes the equilibrium? E.g. what if working memory capacity is limited by the noisiness of neural transmission, and we can reduce the noisiness through gene edits?
FOOM isn’t necessary for doom—the convergent endpoint is that you have dangerously capable minds around: minds which can think much faster and figure out things we can’t. FOOM is one way to get there.
[Scaling law theories]
Sure: here’s a few: quantization model, scaling laws from the data manifold, and a statistical model.
The full specification of the DL system includes the microde, OS, etc. Likewise much of the brain complexity is in the smaller ‘oldbrain’ structures that are the equivalent of a base robot OS. The architectural prior I speak of is the complexity on top of that, which separates us from some ancient earlier vertebrate brain. But again see the brain as a ULM post, which cover the the extensive evidence for emergent learned complexity from simple architecture/algorithms (now the dominant hypothesis in neuroscience).
Most everything above the hardware substrate—but i’ve already provided links to sections of my articles addressing the convergence of DL and neurosci with many dozens of references. So it’d probably be better to focus exactly on what specific key analogies/properties you believe diverge.
DL is extremely general—it’s just efficient approximate bayesian inference over circuit spaces. It doesn’t imply any specific architecture, and doesn’t even strongly imply any specific approx inference/learning algorithm (as 1st and approx 2nd order methods are both common).
Training to increase working memory capacity has near zero effect on IQ or downstream intellectual capabilities—see gwern’s reviews and experiments. Working memory capacity is important in both brains and ANNs (transformers), but it comes from large fast weight synaptic capacity, not simple hacks.
Noise is important for sampling—adequate noise is a feature, not a bug.