ANNs and BNNs operate on the same core principles; the scaling laws apply to both and IQ in either is a mostly function of net effective training compute and data quality.
How do you know this?
Genes determine a brain’s architectural prior just as a small amount of python code determines an ANN’s architectural prior, but the capabilities come only from scaling with compute and data (quantity and quality).
In comparing human brains to DL, training seems more analogous to natural selection than to brain development. Much simpler “architectural prior”, vastly more compute and data.
So you absolutely can not take datasets of gene-IQ correlations and assume those correlations would somehow transfer to gene interventions on adults
We’re really uncertain about how much would transfer! It would probably affect some aspects of intelligence more than others, and I’m afraid it might just not work at all if g is determined by the shape of structures that are ~fixed in adults (e.g. long range white matter connectome). But it’s plausible to me that the more plastic local structures and the properties of individual neurons matter a lot for at least some aspects of intelligence (e.g. see this).
so to the extent this could work at all, it is mostly limited to interventions on children and younger adults who still have significant learning rate reserves
There’s a lot more to intelligence than learning. Combinatorial search, unrolling the consequences of your beliefs, noticing things, forming new abstractions. One might consider forming new abstractions as an important part of learning, which it is, but it seems possible to come up with new abstractions ‘on the spot’ in a way that doesn’t obviously depend on plasticity that much; plasticity would more determine whether the new ideas ‘stick’. I’m bottlenecked by the ability to find new abstractions that usefully simplify reality, not having them stick when I find them.
But it ultimately doesn’t matter, because the brain just learns too slowly. We are now soon past the point at which human learning matters much.
My model is there’s this thing lurking in the distance, I’m not sure how far out: dangerously capable AI (call it DCAI). If our current civilization manages to cough up one of those, we’re all dead, essentially by definition (if DCAI doesn’t kill everyone, it’s because technical alignment was solved, which our current civilization looks very unlikely to accomplish). We look to be on a trajectory to cough one of those up, but It isn’t at all obvious to me that it’s just around the corner: so stuff like this seems worth trying, since humans qualitatively smarter than any current humans might have a shot at thinking of a way out that we didn’t think of (or just having the mental horsepower to quickly get working something we have thought of, e.g. getting mind uploading working).
ANNs and BNNs operate on the same core principles; the scaling laws apply to both and IQ in either is a mostly function of net effective training compute and data quality.
How do you know this?
From study of DL and neuroscience of course. I’ve also written on this for LW in some reasonably well known posts: starting with The Brain as a Universal Learning Machine, and continuing in Brain Efficiency, and AI Timelines specifically see the Cultural Scaling Criticality section on the source of human intelligence, or the DL section of simboxes. Or you could see Steven Byrne’s extensive LW writings on the brain—we are mostly in agreement on the current consensus from computational/systems neuroscience.
The scaling laws are extremely well established in DL and there are strong theoretical reasons (and increasingly experimental neurosci evidence) that they are universal to all NNs, and we have good theoretical models of why they arise. Strong performance arises from search (bayesian inference) over a large circuit space. Strong general performance is strong performance on many many diverse subtasks which require many many specific circuits built on top of compressed/shared base circuits down a heirarchy. The strongest quantitative predictor of performance is the volume of search space explored which is the product of C * T (capacity and data/time). Data quality matters in the sense that the search volume quantitative function of predictive loss only applies to tasks similar enough to the training data distribution.
In comparing human brains to DL, training seems more analogous to natural selection than to brain development. Much simpler “architectural prior”, vastly more compute and data.
No—biological evolution via natural selection is very similar to technological evolution via engineering. Both brains and DL systems have fairly simple architectural priors in comparison to the emergent learned complexity (remember whenever I use the term learning, I use it in a technical sense, not a colloquial sense) - see my first early ULM post for a review of the extensive evidence (greatly substantiated now by my scaling hypothesis predictions coming true with the scaling of transformers which are similar to the archs I discussed in that post).
so to the extent this could work at all, it is mostly limited to interventions on children and younger adults who still have significant learning rate reserves
There’s a lot more to intelligence than learning.
Whenever I use the word learning, without further clarification, I mean learning as in bayesian learning or deep learning, not in the colloquial sense. My definition/sense of learning encompasses all significant changes to synapses/weights and is all encompassing.
Combinatorial search, unrolling the consequences of your beliefs, noticing things, forming new abstractions.
Brains are very slow so have limited combinatorial search, and our search/planning is just short term learning (short/medium term plasticity). Again it’s nearly all learning (synaptic updates).
if DCAI doesn’t kill everyone, it’s because technical alignment was solved, which our current civilization looks very unlikely to accomplish)
I find the standard arguments for doom implausible—they rely on many assumptions contradicted by deep knowledge of computational neuroscience and DL.
I was at the WBE2 workshop with Davidad but haven’t yet had time to write about progress (or lack thereof); I think we probably mostly agree that the type of uploading moonshot he discusses there is enormously expensive (not just in initial R&D, but also in recurring per scan costs). I am actually more optimistic than more pure DL based approaches will scale to much lower cost, but “much lower cost” is still on order of GPT4 training cost just to process the scan data through a simple vision ANN—for a single upload.
The scaling laws are extremely well established in DL and there are strong theoretical reasons (and increasingly experimental neurosci evidence) that they are universal to all NNs, and we have good theoretical models of why they arise.
I’m not aware of these—do you have any references?
Both brains and DL systems have fairly simple architectural priors in comparison to the emergent learned complexity
True but misleading? Isn’t the brain’s “architectural prior” a heckuva lot more complex than the things used in DL?
Brains are very slow so have limited combinatorial search, and our search/planning is just short term learning (short/medium term plasticity). Again it’s nearly all learning (synaptic updates).
Sure. The big crux here is whether plasticity of stuff which is normally “locked down” in adulthood is needed to significantly increase “fluid intelligence” (by which I mean, something like, whatever allows people to invent useful new concepts and find ingenious applications of existing concepts). I’m not convinced these DL analogies are useful—what properties do brains and deepnets share that renders the analogies useful here? DL is a pretty specific thing, so by default I’d strongly expect brains to differ in important ways. E.g. what if the structures whose shapes determine the strength of fluid intelligence aren’t actually “locked down”, but reach a genetically-influenced equilibrium by adulthood, and changing the genes changes the equilibrium? E.g. what if working memory capacity is limited by the noisiness of neural transmission, and we can reduce the noisiness through gene edits?
I find the standard arguments for doom implausible—they rely on many assumptions contradicted by deep knowledge of computational neuroscience and DL
FOOM isn’t necessary for doom—the convergent endpoint is that you have dangerously capable minds around: minds which can think much faster and figure out things we can’t. FOOM is one way to get there.
True but misleading? Isn’t the brain’s “architectural prior” a heckuva lot more complex than the things used in DL?
The full specification of the DL system includes the microde, OS, etc. Likewise much of the brain complexity is in the smaller ‘oldbrain’ structures that are the equivalent of a base robot OS. The architectural prior I speak of is the complexity on top of that, which separates us from some ancient earlier vertebrate brain. But again see the brain as a ULM post, which cover the the extensive evidence for emergent learned complexity from simple architecture/algorithms (now the dominant hypothesis in neuroscience).
I’m not convinced these DL analogies are useful—what properties do brains and deepnets share that renders the analogies useful here?
Most everything above the hardware substrate—but i’ve already provided links to sections of my articles addressing the convergence of DL and neurosci with many dozens of references. So it’d probably be better to focus exactly on what specific key analogies/properties you believe diverge.
DL is a pretty specific thing
DL is extremely general—it’s just efficient approximate bayesian inference over circuit spaces. It doesn’t imply any specific architecture, and doesn’t even strongly imply any specific approx inference/learning algorithm (as 1st and approx 2nd order methods are both common).
E.g. what if working memory capacity is limited by the noisiness of neural transmission, and we can reduce the noisiness through gene edits?
Training to increase working memory capacity has near zero effect on IQ or downstream intellectual capabilities—see gwern’s reviews and experiments. Working memory capacity is important in both brains and ANNs (transformers), but it comes from large fast weight synaptic capacity, not simple hacks.
Noise is important for sampling—adequate noise is a feature, not a bug.
How do you know this?
In comparing human brains to DL, training seems more analogous to natural selection than to brain development. Much simpler “architectural prior”, vastly more compute and data.
We’re really uncertain about how much would transfer! It would probably affect some aspects of intelligence more than others, and I’m afraid it might just not work at all if g is determined by the shape of structures that are ~fixed in adults (e.g. long range white matter connectome). But it’s plausible to me that the more plastic local structures and the properties of individual neurons matter a lot for at least some aspects of intelligence (e.g. see this).
There’s a lot more to intelligence than learning. Combinatorial search, unrolling the consequences of your beliefs, noticing things, forming new abstractions. One might consider forming new abstractions as an important part of learning, which it is, but it seems possible to come up with new abstractions ‘on the spot’ in a way that doesn’t obviously depend on plasticity that much; plasticity would more determine whether the new ideas ‘stick’. I’m bottlenecked by the ability to find new abstractions that usefully simplify reality, not having them stick when I find them.
My model is there’s this thing lurking in the distance, I’m not sure how far out: dangerously capable AI (call it DCAI). If our current civilization manages to cough up one of those, we’re all dead, essentially by definition (if DCAI doesn’t kill everyone, it’s because technical alignment was solved, which our current civilization looks very unlikely to accomplish). We look to be on a trajectory to cough one of those up, but It isn’t at all obvious to me that it’s just around the corner: so stuff like this seems worth trying, since humans qualitatively smarter than any current humans might have a shot at thinking of a way out that we didn’t think of (or just having the mental horsepower to quickly get working something we have thought of, e.g. getting mind uploading working).
From study of DL and neuroscience of course. I’ve also written on this for LW in some reasonably well known posts: starting with The Brain as a Universal Learning Machine, and continuing in Brain Efficiency, and AI Timelines specifically see the Cultural Scaling Criticality section on the source of human intelligence, or the DL section of simboxes. Or you could see Steven Byrne’s extensive LW writings on the brain—we are mostly in agreement on the current consensus from computational/systems neuroscience.
The scaling laws are extremely well established in DL and there are strong theoretical reasons (and increasingly experimental neurosci evidence) that they are universal to all NNs, and we have good theoretical models of why they arise. Strong performance arises from search (bayesian inference) over a large circuit space. Strong general performance is strong performance on many many diverse subtasks which require many many specific circuits built on top of compressed/shared base circuits down a heirarchy. The strongest quantitative predictor of performance is the volume of search space explored which is the product of C * T (capacity and data/time). Data quality matters in the sense that the search volume quantitative function of predictive loss only applies to tasks similar enough to the training data distribution.
No—biological evolution via natural selection is very similar to technological evolution via engineering. Both brains and DL systems have fairly simple architectural priors in comparison to the emergent learned complexity (remember whenever I use the term learning, I use it in a technical sense, not a colloquial sense) - see my first early ULM post for a review of the extensive evidence (greatly substantiated now by my scaling hypothesis predictions coming true with the scaling of transformers which are similar to the archs I discussed in that post).
Whenever I use the word learning, without further clarification, I mean learning as in bayesian learning or deep learning, not in the colloquial sense. My definition/sense of learning encompasses all significant changes to synapses/weights and is all encompassing.
Brains are very slow so have limited combinatorial search, and our search/planning is just short term learning (short/medium term plasticity). Again it’s nearly all learning (synaptic updates).
I find the standard arguments for doom implausible—they rely on many assumptions contradicted by deep knowledge of computational neuroscience and DL.
I was at the WBE2 workshop with Davidad but haven’t yet had time to write about progress (or lack thereof); I think we probably mostly agree that the type of uploading moonshot he discusses there is enormously expensive (not just in initial R&D, but also in recurring per scan costs). I am actually more optimistic than more pure DL based approaches will scale to much lower cost, but “much lower cost” is still on order of GPT4 training cost just to process the scan data through a simple vision ANN—for a single upload.
I’m not aware of these—do you have any references?
True but misleading? Isn’t the brain’s “architectural prior” a heckuva lot more complex than the things used in DL?
Sure. The big crux here is whether plasticity of stuff which is normally “locked down” in adulthood is needed to significantly increase “fluid intelligence” (by which I mean, something like, whatever allows people to invent useful new concepts and find ingenious applications of existing concepts). I’m not convinced these DL analogies are useful—what properties do brains and deepnets share that renders the analogies useful here? DL is a pretty specific thing, so by default I’d strongly expect brains to differ in important ways. E.g. what if the structures whose shapes determine the strength of fluid intelligence aren’t actually “locked down”, but reach a genetically-influenced equilibrium by adulthood, and changing the genes changes the equilibrium? E.g. what if working memory capacity is limited by the noisiness of neural transmission, and we can reduce the noisiness through gene edits?
FOOM isn’t necessary for doom—the convergent endpoint is that you have dangerously capable minds around: minds which can think much faster and figure out things we can’t. FOOM is one way to get there.
[Scaling law theories]
Sure: here’s a few: quantization model, scaling laws from the data manifold, and a statistical model.
The full specification of the DL system includes the microde, OS, etc. Likewise much of the brain complexity is in the smaller ‘oldbrain’ structures that are the equivalent of a base robot OS. The architectural prior I speak of is the complexity on top of that, which separates us from some ancient earlier vertebrate brain. But again see the brain as a ULM post, which cover the the extensive evidence for emergent learned complexity from simple architecture/algorithms (now the dominant hypothesis in neuroscience).
Most everything above the hardware substrate—but i’ve already provided links to sections of my articles addressing the convergence of DL and neurosci with many dozens of references. So it’d probably be better to focus exactly on what specific key analogies/properties you believe diverge.
DL is extremely general—it’s just efficient approximate bayesian inference over circuit spaces. It doesn’t imply any specific architecture, and doesn’t even strongly imply any specific approx inference/learning algorithm (as 1st and approx 2nd order methods are both common).
Training to increase working memory capacity has near zero effect on IQ or downstream intellectual capabilities—see gwern’s reviews and experiments. Working memory capacity is important in both brains and ANNs (transformers), but it comes from large fast weight synaptic capacity, not simple hacks.
Noise is important for sampling—adequate noise is a feature, not a bug.