Epistemic Status: This is a quick post on something I have been confused about for a while. If an answer to this is known, please reach out and let me know!
In ML we find that the performance of models tends towards some kind of power-law relationship between the loss and the amount of data in the dataset or the number of parameters of the model. What this means is, in effect, that to get a constant decrease in the loss, we need to increase either the data or the model or some combination of both by a constant factor. Power law scaling appears to occur for most models studied, including in extremely simple toy examples and hence appears to be some kind of fundamental property of how ‘intelligence’ scales, although for reasons that are at the moment quite unclear (at least to me—if you know please reach out and tell me!)
Crucially, power law scaling is actually pretty bad and means that performance grows relatively slowly with scale. A model with twice as many parameters or twice as much data does not perform twice as well. These diminishing returns to intelligence are of immense importance for forecasting AI risks since whether FOOM is possible or not depends heavily on the returns to increasing intelligence in the range around the human level.
In biology, we also see power law scaling between species. For instance, there is a clear power law scaling curve relating the brain size of various species with roughly how ‘intelligent’, we think they are. Indeed, there are general cross-species scaling laws for intelligence and neuron count and density, with primates being on a superior scaling law to most other animals. These scaling laws are again slow. It takes a very significant amount of additional neurons or brain size to really move the needle on observed intelligence. We also see that brain size, unsurprisingly, is a very effective measure of the ‘parameter count’ at least within species which share the same neural density scaling laws.
However, on the inside view, we know there are significant differences in intellectual performance between humans. The differences in performance between tasks are also strongly correlated with each other, such that if someone is bad or good at one task, it is pretty likely that they will also be bad or good at another. If you analze many such intellectual tasks, and perform factor analysis you tend to get a single dominant factor, which is called the general intelligence factor g. Numerous studies have demonstrated that IQ is a highly reliable measure, is strongly correlated with performance measures such as occupational success, and that a substantial component of IQ is genetic. However, genetic variation of humans on key parameters such as brain size or neuron count, as well as data input, while extant, is very small compared to the logarithmic scaling law factors. Natural human brain size variation does not range over 2x brain volume let alone a 10x or multiple order of magnitude difference[1]. Under the scaling laws view, this would predict that individual differences in IQ between humans are very small, and essentially logarithmic on the loss.
However, at least from our vantage point, this is not what we observe. Individual differences between humans (and also other animals) appear to very strongly impact performance on an extremely wide range of ‘downsteam tasks’. IQs at +3 standard deviations, despite their rarity in the population are responsible for the vast majority of intellectual advancement, while humans of IQ −3 standard deviations are extremely challenged with even simple intellectual tasks. This seems like a very large variation in objective performance which is not predicted by the scaling laws view.
Another thing that is highly confusing is the occurence of rare cognitive abilities of a specialized type rather than a pure g individual ability. The existence of savantism is the case in point here. Savants can have highly specialized abilities such as a true photographic (eidetic) memory, the ability to perform extremely complex calculations in their heads (human calculators) and many others. These cases show that, despite having a very ANN-like parallel architectures, to some degree humans can end up specialized in more standard serial computer tasks, although this usually (but not always) comes at the cost of general IQ and functioning. It also shows how much variability in performance there is on humans based on what must be relatively small differences in brain architecture, something which is not predicted by a general scaling laws analysis.
Basically, the crux is this: We have good reason to suspect that biological intelligence, and hence human intelligence roughly follow similar scaling law patterns to what we observe in machine learning systems. At the scale at which the human brain operates, the scaling laws would predict very large absolute changes in parameter count would be necessary for a significant change in performance. However, when we look at natural human variation in IQ, we see apparently very large changes in intellectual ability without correspondingly large changes in brain size.
How can we resolve this paradox? There are several options, but I am not confident in any of them.
1.) Perhaps there is no contradiction. The scaling laws consider the impact of parameters on the log-loss however we can never observe the log loss directly. Only performance of humans on downstream tasks. It is possible that humans are at a point on the scaling law where the density of tasks that can be solved per unit of log-loss is extremely high, such that even tiny variations in log-loss result in large apparent differences in performance. This would imply that the geometry of the log-loss surface is essentially conformal, such that as we approach the irreducible log loss of the dataset, the density of tasks that can be solved increases rapidly, so that we asymptotically approach a limit where at the irreducible loss we can solve all tasks. Something like this must be true, insofar at large losses, decreasing the loss is easy but then becomes increasingly difficult as further decreases in the loss require the learning of ever more subtle patterns. If this is true, then this would mean that the scaling laws would highly underestimate the effects of scale, since the exponential increase in the number of tasks solvable given a unit of log-loss would precisely counteract the logarithmic increase in parameter count, meaning that there would be a linear relationship between the parameter count and the number of tasks that can be solved.
2.) Perhaps our inside view is wrong and humans IQ differences are objectively tiny, including on downstream performance tasks. We just observe them as large due our limited perspective on the full intelligence distribution, and also the selection effects of choosing to measure IQ based on tasks that humans vary considerably over vs the extremely large number of tasks that almost all humans of whatever IQ could solve, or alternatively the tasks that no human could ever solve, no matter their IQ. My feeling is that this is probably some of it, but cannot be the whole story, since the human IQ-performance range seems to cover a wide range of objectively useful tasks like ‘be able to make technological advancements or not’.
3.) Individual differences in IQ are due to different causes than the scaling laws of parameter count and dataset size, and hence have very large effects on log-loss. This is in line with the mutational load view which argues that IQ differences are caused by the number of deleterious mutations and the normal distribution arises from small additive genetic variance caused by small mutations. Under this view, humans would have an ‘ideal scaling law IQ’ significantly higher than any actually observed human IQ corresponding to 0 deleterious mutations, and then our actually observed performance is substantially below this. This would have implications for bio-anchors style AI forecasting since it would mean that the ‘human-level’ estimates are about an intelligence level potentially much greater than actually observed humans which would shorten timelines. The converse would also suggest that improvements to IQ based on, say, iterated embryo selection would be expected to asymptotically vanish as we approach our ‘scaling-law IQ’ so long as we do not also massively increase brain size.
This does leave unanswered the question of what exactly these deleterious mutations actually do. They cannot be variations in brainsize, and dataset input is not particularly affectable by the genome. My hunch is that it is something to do with noise levels or connectivity patterns. It could just be something simple such as that mutations tend to defacto increase the level of noise in neural computation and hence decrease the SNR on each step. This would make long range communication, as well as multi-step sequential thought vastly more difficult to do correctly, leading to much poorer performance on tasks which require consistent coherent thought over multiple steps while impairing much less tasks which can be performed by a standard highly parallel feedforward sweep like core object recognition which, as far as I know, is much less IQ loaded than, say, solving logic puzzles. The SNR view would also predict the common finding that IQ is correlated with reaction time to visual and auditory stimuli. Supposing that to be sure enough to make a judgement, the brain uses something like the drift-diffusion model (which is bayes-optimal for 2AFC tasks), more noise would require longer time to integrate neural signals to achieve the required signal, and lead to a slowed reaction time. Finally, another possibility here is that genetic differences could affect neuron density in the brain, potentially enabling a much greater scaling of parameter count than would be observable from external brain size [2].
4.) Perhaps individual differences among humans are actually architectural differences which affect scaling law coefficients instead of parameter count. I.e. it could be that people with low IQs are defacto implementing a much worse general purpose brain architecture with worse scaling coefficients in general than those with higher IQs. This could result in almost arbitrary differences in log-loss at the same parameter count. My feeling however is that this seems unlikely, given that architectural differences seem large enough that they seem like they would be detectable at the macro-scale, such as in fMRI studies, and I do not know of any results showing this in neuroscience. It also contradicts the additive genetic variance story, and hence the normal distribution of IQ observed, since architectural variation should be expected to be roughly discrete rather than a continuous spectrum. However, perhaps architectural differences are implicitly encoded not in macro-scale structure but in fine-grained connectivity differences which would not be particularly visible in neuroscience. This is consistent with theorizing that perhaps disorders due to autism are due to altered connectivity patterns. There could then be better or worse scaling coefficients for different connectivity patterns, or perhaps the connectivity pattern encodes inductive biases which are better or worse for specific situations.
My personal feeling is that the answer is probably some combination of 1 and 3, along with a small amount of selection effects due to 2 making us believe that individual differences are larger than they are ‘objectively’. Architectural differences are also possible, although I remain confused about how these can be implemented by genetic mutations which are small enough to give rise to an additive effect and hence a normal distribution of outcomes. Of course, it could be that the normal distribution itself is some kind of artefact of our testing methodology. Also, architectural changes might more plausibly seem to be the cause of unusual cognitive abilities such as savantism and certain mental disorders such as autism (where autism and savantism almost always cluster together—i.e. almost everyone with savant skills are autistic but not vice versa).
Interestingly, there is a consistent finding of a modest but consistent positive correlation between brain size and IQ. This suggests that at least some of the variance in IQ is caused by scaling law like effects, but by no means the majority. This is supported by the scaling laws which predict that large changes in brain size would be necessary to obtain significant changes in cognitive performance.
It is argued here that a difference in neuronal density scaling is what differentiates primates from other mammals and is thus why large animals such as elephants and whales are not more intelligent than humans despite their larger brains. Small mutations which affect neuronal density could thus lead to different humans having significantly different neuron counts (and hence scaling law IQs) despite having approximately the same gross brain volume.
Scaling laws vs individual differences
Crossposted from my personal blog
Epistemic Status: This is a quick post on something I have been confused about for a while. If an answer to this is known, please reach out and let me know!
In ML we find that the performance of models tends towards some kind of power-law relationship between the loss and the amount of data in the dataset or the number of parameters of the model. What this means is, in effect, that to get a constant decrease in the loss, we need to increase either the data or the model or some combination of both by a constant factor. Power law scaling appears to occur for most models studied, including in extremely simple toy examples and hence appears to be some kind of fundamental property of how ‘intelligence’ scales, although for reasons that are at the moment quite unclear (at least to me—if you know please reach out and tell me!)
Crucially, power law scaling is actually pretty bad and means that performance grows relatively slowly with scale. A model with twice as many parameters or twice as much data does not perform twice as well. These diminishing returns to intelligence are of immense importance for forecasting AI risks since whether FOOM is possible or not depends heavily on the returns to increasing intelligence in the range around the human level.
In biology, we also see power law scaling between species. For instance, there is a clear power law scaling curve relating the brain size of various species with roughly how ‘intelligent’, we think they are. Indeed, there are general cross-species scaling laws for intelligence and neuron count and density, with primates being on a superior scaling law to most other animals. These scaling laws are again slow. It takes a very significant amount of additional neurons or brain size to really move the needle on observed intelligence. We also see that brain size, unsurprisingly, is a very effective measure of the ‘parameter count’ at least within species which share the same neural density scaling laws.
However, on the inside view, we know there are significant differences in intellectual performance between humans. The differences in performance between tasks are also strongly correlated with each other, such that if someone is bad or good at one task, it is pretty likely that they will also be bad or good at another. If you analze many such intellectual tasks, and perform factor analysis you tend to get a single dominant factor, which is called the general intelligence factor g. Numerous studies have demonstrated that IQ is a highly reliable measure, is strongly correlated with performance measures such as occupational success, and that a substantial component of IQ is genetic. However, genetic variation of humans on key parameters such as brain size or neuron count, as well as data input, while extant, is very small compared to the logarithmic scaling law factors. Natural human brain size variation does not range over 2x brain volume let alone a 10x or multiple order of magnitude difference[1]. Under the scaling laws view, this would predict that individual differences in IQ between humans are very small, and essentially logarithmic on the loss.
However, at least from our vantage point, this is not what we observe. Individual differences between humans (and also other animals) appear to very strongly impact performance on an extremely wide range of ‘downsteam tasks’. IQs at +3 standard deviations, despite their rarity in the population are responsible for the vast majority of intellectual advancement, while humans of IQ −3 standard deviations are extremely challenged with even simple intellectual tasks. This seems like a very large variation in objective performance which is not predicted by the scaling laws view.
Another thing that is highly confusing is the occurence of rare cognitive abilities of a specialized type rather than a pure g individual ability. The existence of savantism is the case in point here. Savants can have highly specialized abilities such as a true photographic (eidetic) memory, the ability to perform extremely complex calculations in their heads (human calculators) and many others. These cases show that, despite having a very ANN-like parallel architectures, to some degree humans can end up specialized in more standard serial computer tasks, although this usually (but not always) comes at the cost of general IQ and functioning. It also shows how much variability in performance there is on humans based on what must be relatively small differences in brain architecture, something which is not predicted by a general scaling laws analysis.
Basically, the crux is this: We have good reason to suspect that biological intelligence, and hence human intelligence roughly follow similar scaling law patterns to what we observe in machine learning systems. At the scale at which the human brain operates, the scaling laws would predict very large absolute changes in parameter count would be necessary for a significant change in performance. However, when we look at natural human variation in IQ, we see apparently very large changes in intellectual ability without correspondingly large changes in brain size.
How can we resolve this paradox? There are several options, but I am not confident in any of them.
1.) Perhaps there is no contradiction. The scaling laws consider the impact of parameters on the log-loss however we can never observe the log loss directly. Only performance of humans on downstream tasks. It is possible that humans are at a point on the scaling law where the density of tasks that can be solved per unit of log-loss is extremely high, such that even tiny variations in log-loss result in large apparent differences in performance. This would imply that the geometry of the log-loss surface is essentially conformal, such that as we approach the irreducible log loss of the dataset, the density of tasks that can be solved increases rapidly, so that we asymptotically approach a limit where at the irreducible loss we can solve all tasks. Something like this must be true, insofar at large losses, decreasing the loss is easy but then becomes increasingly difficult as further decreases in the loss require the learning of ever more subtle patterns. If this is true, then this would mean that the scaling laws would highly underestimate the effects of scale, since the exponential increase in the number of tasks solvable given a unit of log-loss would precisely counteract the logarithmic increase in parameter count, meaning that there would be a linear relationship between the parameter count and the number of tasks that can be solved.
2.) Perhaps our inside view is wrong and humans IQ differences are objectively tiny, including on downstream performance tasks. We just observe them as large due our limited perspective on the full intelligence distribution, and also the selection effects of choosing to measure IQ based on tasks that humans vary considerably over vs the extremely large number of tasks that almost all humans of whatever IQ could solve, or alternatively the tasks that no human could ever solve, no matter their IQ. My feeling is that this is probably some of it, but cannot be the whole story, since the human IQ-performance range seems to cover a wide range of objectively useful tasks like ‘be able to make technological advancements or not’.
3.) Individual differences in IQ are due to different causes than the scaling laws of parameter count and dataset size, and hence have very large effects on log-loss. This is in line with the mutational load view which argues that IQ differences are caused by the number of deleterious mutations and the normal distribution arises from small additive genetic variance caused by small mutations. Under this view, humans would have an ‘ideal scaling law IQ’ significantly higher than any actually observed human IQ corresponding to 0 deleterious mutations, and then our actually observed performance is substantially below this. This would have implications for bio-anchors style AI forecasting since it would mean that the ‘human-level’ estimates are about an intelligence level potentially much greater than actually observed humans which would shorten timelines. The converse would also suggest that improvements to IQ based on, say, iterated embryo selection would be expected to asymptotically vanish as we approach our ‘scaling-law IQ’ so long as we do not also massively increase brain size.
This does leave unanswered the question of what exactly these deleterious mutations actually do. They cannot be variations in brainsize, and dataset input is not particularly affectable by the genome. My hunch is that it is something to do with noise levels or connectivity patterns. It could just be something simple such as that mutations tend to defacto increase the level of noise in neural computation and hence decrease the SNR on each step. This would make long range communication, as well as multi-step sequential thought vastly more difficult to do correctly, leading to much poorer performance on tasks which require consistent coherent thought over multiple steps while impairing much less tasks which can be performed by a standard highly parallel feedforward sweep like core object recognition which, as far as I know, is much less IQ loaded than, say, solving logic puzzles. The SNR view would also predict the common finding that IQ is correlated with reaction time to visual and auditory stimuli. Supposing that to be sure enough to make a judgement, the brain uses something like the drift-diffusion model (which is bayes-optimal for 2AFC tasks), more noise would require longer time to integrate neural signals to achieve the required signal, and lead to a slowed reaction time. Finally, another possibility here is that genetic differences could affect neuron density in the brain, potentially enabling a much greater scaling of parameter count than would be observable from external brain size [2].
4.) Perhaps individual differences among humans are actually architectural differences which affect scaling law coefficients instead of parameter count. I.e. it could be that people with low IQs are defacto implementing a much worse general purpose brain architecture with worse scaling coefficients in general than those with higher IQs. This could result in almost arbitrary differences in log-loss at the same parameter count. My feeling however is that this seems unlikely, given that architectural differences seem large enough that they seem like they would be detectable at the macro-scale, such as in fMRI studies, and I do not know of any results showing this in neuroscience. It also contradicts the additive genetic variance story, and hence the normal distribution of IQ observed, since architectural variation should be expected to be roughly discrete rather than a continuous spectrum. However, perhaps architectural differences are implicitly encoded not in macro-scale structure but in fine-grained connectivity differences which would not be particularly visible in neuroscience. This is consistent with theorizing that perhaps disorders due to autism are due to altered connectivity patterns. There could then be better or worse scaling coefficients for different connectivity patterns, or perhaps the connectivity pattern encodes inductive biases which are better or worse for specific situations.
My personal feeling is that the answer is probably some combination of 1 and 3, along with a small amount of selection effects due to 2 making us believe that individual differences are larger than they are ‘objectively’. Architectural differences are also possible, although I remain confused about how these can be implemented by genetic mutations which are small enough to give rise to an additive effect and hence a normal distribution of outcomes. Of course, it could be that the normal distribution itself is some kind of artefact of our testing methodology. Also, architectural changes might more plausibly seem to be the cause of unusual cognitive abilities such as savantism and certain mental disorders such as autism (where autism and savantism almost always cluster together—i.e. almost everyone with savant skills are autistic but not vice versa).
Interestingly, there is a consistent finding of a modest but consistent positive correlation between brain size and IQ. This suggests that at least some of the variance in IQ is caused by scaling law like effects, but by no means the majority. This is supported by the scaling laws which predict that large changes in brain size would be necessary to obtain significant changes in cognitive performance.
It is argued here that a difference in neuronal density scaling is what differentiates primates from other mammals and is thus why large animals such as elephants and whales are not more intelligent than humans despite their larger brains. Small mutations which affect neuronal density could thus lead to different humans having significantly different neuron counts (and hence scaling law IQs) despite having approximately the same gross brain volume.