Scaling laws vs individual differences
Crossposted from my personal blog
Epistemic Status: This is a quick post on something I have been confused about for a while. If an answer to this is known, please reach out and let me know!
In ML we find that the performance of models tends towards some kind of power-law relationship between the loss and the amount of data in the dataset or the number of parameters of the model. What this means is, in effect, that to get a constant decrease in the loss, we need to increase either the data or the model or some combination of both by a constant factor. Power law scaling appears to occur for most models studied, including in extremely simple toy examples and hence appears to be some kind of fundamental property of how ‘intelligence’ scales, although for reasons that are at the moment quite unclear (at least to me—if you know please reach out and tell me!)
Crucially, power law scaling is actually pretty bad and means that performance grows relatively slowly with scale. A model with twice as many parameters or twice as much data does not perform twice as well. These diminishing returns to intelligence are of immense importance for forecasting AI risks since whether FOOM is possible or not depends heavily on the returns to increasing intelligence in the range around the human level.
In biology, we also see power law scaling between species. For instance, there is a clear power law scaling curve relating the brain size of various species with roughly how ‘intelligent’, we think they are. Indeed, there are general cross-species scaling laws for intelligence and neuron count and density, with primates being on a superior scaling law to most other animals. These scaling laws are again slow. It takes a very significant amount of additional neurons or brain size to really move the needle on observed intelligence. We also see that brain size, unsurprisingly, is a very effective measure of the ‘parameter count’ at least within species which share the same neural density scaling laws.
However, on the inside view, we know there are significant differences in intellectual performance between humans. The differences in performance between tasks are also strongly correlated with each other, such that if someone is bad or good at one task, it is pretty likely that they will also be bad or good at another. If you analze many such intellectual tasks, and perform factor analysis you tend to get a single dominant factor, which is called the general intelligence factor g. Numerous studies have demonstrated that IQ is a highly reliable measure, is strongly correlated with performance measures such as occupational success, and that a substantial component of IQ is genetic. However, genetic variation of humans on key parameters such as brain size or neuron count, as well as data input, while extant, is very small compared to the logarithmic scaling law factors. Natural human brain size variation does not range over 2x brain volume let alone a 10x or multiple order of magnitude difference[1]. Under the scaling laws view, this would predict that individual differences in IQ between humans are very small, and essentially logarithmic on the loss.
However, at least from our vantage point, this is not what we observe. Individual differences between humans (and also other animals) appear to very strongly impact performance on an extremely wide range of ‘downsteam tasks’. IQs at +3 standard deviations, despite their rarity in the population are responsible for the vast majority of intellectual advancement, while humans of IQ −3 standard deviations are extremely challenged with even simple intellectual tasks. This seems like a very large variation in objective performance which is not predicted by the scaling laws view.
Another thing that is highly confusing is the occurence of rare cognitive abilities of a specialized type rather than a pure g individual ability. The existence of savantism is the case in point here. Savants can have highly specialized abilities such as a true photographic (eidetic) memory, the ability to perform extremely complex calculations in their heads (human calculators) and many others. These cases show that, despite having a very ANN-like parallel architectures, to some degree humans can end up specialized in more standard serial computer tasks, although this usually (but not always) comes at the cost of general IQ and functioning. It also shows how much variability in performance there is on humans based on what must be relatively small differences in brain architecture, something which is not predicted by a general scaling laws analysis.
Basically, the crux is this: We have good reason to suspect that biological intelligence, and hence human intelligence roughly follow similar scaling law patterns to what we observe in machine learning systems. At the scale at which the human brain operates, the scaling laws would predict very large absolute changes in parameter count would be necessary for a significant change in performance. However, when we look at natural human variation in IQ, we see apparently very large changes in intellectual ability without correspondingly large changes in brain size.
How can we resolve this paradox? There are several options, but I am not confident in any of them.
1.) Perhaps there is no contradiction. The scaling laws consider the impact of parameters on the log-loss however we can never observe the log loss directly. Only performance of humans on downstream tasks. It is possible that humans are at a point on the scaling law where the density of tasks that can be solved per unit of log-loss is extremely high, such that even tiny variations in log-loss result in large apparent differences in performance. This would imply that the geometry of the log-loss surface is essentially conformal, such that as we approach the irreducible log loss of the dataset, the density of tasks that can be solved increases rapidly, so that we asymptotically approach a limit where at the irreducible loss we can solve all tasks. Something like this must be true, insofar at large losses, decreasing the loss is easy but then becomes increasingly difficult as further decreases in the loss require the learning of ever more subtle patterns. If this is true, then this would mean that the scaling laws would highly underestimate the effects of scale, since the exponential increase in the number of tasks solvable given a unit of log-loss would precisely counteract the logarithmic increase in parameter count, meaning that there would be a linear relationship between the parameter count and the number of tasks that can be solved.
2.) Perhaps our inside view is wrong and humans IQ differences are objectively tiny, including on downstream performance tasks. We just observe them as large due our limited perspective on the full intelligence distribution, and also the selection effects of choosing to measure IQ based on tasks that humans vary considerably over vs the extremely large number of tasks that almost all humans of whatever IQ could solve, or alternatively the tasks that no human could ever solve, no matter their IQ. My feeling is that this is probably some of it, but cannot be the whole story, since the human IQ-performance range seems to cover a wide range of objectively useful tasks like ‘be able to make technological advancements or not’.
3.) Individual differences in IQ are due to different causes than the scaling laws of parameter count and dataset size, and hence have very large effects on log-loss. This is in line with the mutational load view which argues that IQ differences are caused by the number of deleterious mutations and the normal distribution arises from small additive genetic variance caused by small mutations. Under this view, humans would have an ‘ideal scaling law IQ’ significantly higher than any actually observed human IQ corresponding to 0 deleterious mutations, and then our actually observed performance is substantially below this. This would have implications for bio-anchors style AI forecasting since it would mean that the ‘human-level’ estimates are about an intelligence level potentially much greater than actually observed humans which would shorten timelines. The converse would also suggest that improvements to IQ based on, say, iterated embryo selection would be expected to asymptotically vanish as we approach our ‘scaling-law IQ’ so long as we do not also massively increase brain size.
This does leave unanswered the question of what exactly these deleterious mutations actually do. They cannot be variations in brainsize, and dataset input is not particularly affectable by the genome. My hunch is that it is something to do with noise levels or connectivity patterns. It could just be something simple such as that mutations tend to defacto increase the level of noise in neural computation and hence decrease the SNR on each step. This would make long range communication, as well as multi-step sequential thought vastly more difficult to do correctly, leading to much poorer performance on tasks which require consistent coherent thought over multiple steps while impairing much less tasks which can be performed by a standard highly parallel feedforward sweep like core object recognition which, as far as I know, is much less IQ loaded than, say, solving logic puzzles. The SNR view would also predict the common finding that IQ is correlated with reaction time to visual and auditory stimuli. Supposing that to be sure enough to make a judgement, the brain uses something like the drift-diffusion model (which is bayes-optimal for 2AFC tasks), more noise would require longer time to integrate neural signals to achieve the required signal, and lead to a slowed reaction time. Finally, another possibility here is that genetic differences could affect neuron density in the brain, potentially enabling a much greater scaling of parameter count than would be observable from external brain size [2].
4.) Perhaps individual differences among humans are actually architectural differences which affect scaling law coefficients instead of parameter count. I.e. it could be that people with low IQs are defacto implementing a much worse general purpose brain architecture with worse scaling coefficients in general than those with higher IQs. This could result in almost arbitrary differences in log-loss at the same parameter count. My feeling however is that this seems unlikely, given that architectural differences seem large enough that they seem like they would be detectable at the macro-scale, such as in fMRI studies, and I do not know of any results showing this in neuroscience. It also contradicts the additive genetic variance story, and hence the normal distribution of IQ observed, since architectural variation should be expected to be roughly discrete rather than a continuous spectrum. However, perhaps architectural differences are implicitly encoded not in macro-scale structure but in fine-grained connectivity differences which would not be particularly visible in neuroscience. This is consistent with theorizing that perhaps disorders due to autism are due to altered connectivity patterns. There could then be better or worse scaling coefficients for different connectivity patterns, or perhaps the connectivity pattern encodes inductive biases which are better or worse for specific situations.
My personal feeling is that the answer is probably some combination of 1 and 3, along with a small amount of selection effects due to 2 making us believe that individual differences are larger than they are ‘objectively’. Architectural differences are also possible, although I remain confused about how these can be implemented by genetic mutations which are small enough to give rise to an additive effect and hence a normal distribution of outcomes. Of course, it could be that the normal distribution itself is some kind of artefact of our testing methodology. Also, architectural changes might more plausibly seem to be the cause of unusual cognitive abilities such as savantism and certain mental disorders such as autism (where autism and savantism almost always cluster together—i.e. almost everyone with savant skills are autistic but not vice versa).
- ^
Interestingly, there is a consistent finding of a modest but consistent positive correlation between brain size and IQ. This suggests that at least some of the variance in IQ is caused by scaling law like effects, but by no means the majority. This is supported by the scaling laws which predict that large changes in brain size would be necessary to obtain significant changes in cognitive performance.
- ^
It is argued here that a difference in neuronal density scaling is what differentiates primates from other mammals and is thus why large animals such as elephants and whales are not more intelligent than humans despite their larger brains. Small mutations which affect neuronal density could thus lead to different humans having significantly different neuron counts (and hence scaling law IQs) despite having approximately the same gross brain volume.
I suspect that individual differences in intrinsic motivation / reward function are at least as important as anything you mentioned. In particular, I disagree with your statement “dataset input is not particularly affectable by the genome”. If person A finds it enjoyable and invigorating to be around people, and to pay attention to people, and to think about people, then their lifetime “dataset” will be very people-centric. Conversely, if someone spends their early childhood rotating shapes in their head whenever they have a free moment, they’re going to get really good at it. I have an 8yo kid who loves math and thinks about it all the time. Like one time we were sitting at the dinner table talking about paint colors or whatever, and he interrupted the conversation to tell us something he had figured out about exponentiation. I don’t know what kind of reward function leads to that, but clearly it’s possible.
You also neglected to mention hyperparameters, I think. (Actually, maybe it’s part of your 3.) For example, I imagine that in ML, changing learning rate a bit (for example) can have an outsized effect on final performance. I think there are a lot of things in that general category in the brain. For example, what is the exact curve relating milliseconds-of-delay-versus-synapse-plasticity in STDP? It probably depends on lots of little things in the genome (SNPs in various proteins involved in the process, or whatever). And probably some possible milliseconds-of-delay-versus-synapse-plasticity curves are better than others for learning.
Yes, volume is definitely not the only thing going on with human brains. Human brains are not identical, the way ANNs can be identical save for a knob in a config file increasing the parameter count. (Nor is parameter count the only thing going on with DL scaling, for that matter.) Intelligence is highly polygenic, and the brain volume genetic correlations with intelligence are, while apparently causal, much less than 1 (while intelligence genetically correlates with lot of other things); the brain imaging studies also show predicting intelligence taps into a lot more aspects of static neuroanatomy or dynamic patterns than simply brain volume (or some deeper neuron-count). Things like myelination and mitochondrial function will matter, and will support the development processes. General bodily integrity and health and mutation load on all bodily systems will matter. All of these will influence development and the ability to develop connected-but-not-too-connected brain networks which can flexibly coordinate to support fluid intelligence activity. So while you can fiddle the knob and train the same model at different parameter scales and extract the power law, you can’t do that when you compare human brains: it’s as if not only are all the hyperparameters randomized a little each run, the GPUs trained on will convert electricity to FLOPs at wildly different rates, some GPUs just won’t multiply numbers quite right (each one multiplying wrongly in a different way), the occasional layer in a checkpoint might be replaced with some Gaussian noise… (So you can see the influence of volume at the species level because you’re comparing group means where all the noise washes out, but then at individual level it may be much more confusing.)
Do you have links for these studies? Would leave to have a read about the static and dynamic correlates of g are from brain imaging!
I largely disagree about the intrinsic motivation/reward function points. There is a lot of evidence that there is at least some amount of general intelligence which is independent of interest in particular fields/topics. Of course, if you have a high level of intelligence + interest then your dataset will be heavily oriented towards that topic and you will gain a lot of skill in it, but the underlying aptitude/intelligence can be factored out of this.
How exactly specific interests are encoded is a different and also super fascinating question! It definitely isn’t a pure ‘bit prediction’ intrinsic curiosity since different people seem to care a lot about different kinds of bits. It is at least somewhat affected by external culture / datasets but not entirely (people can often be interested in things against cultural pressure or often before they really know what their interest is). It doesn’t seem super influenced by external reward in a lot of cases. To some extent it ties in with intrinsic aptitude (people tend to be interested in things they are good at) but of course this is at least somewhat circular since people tend to get better at things they are interested in, ceteris paribus.
The hyperparameters is a good point. I was thinking about this largely as architectural changes but I think that I was wrong about this they are much more continuous and also potentially much more flexible genetically. This seems to be a better and more likely explanation for continuous IQ distributions than architecture directly. It would definitely be interesting to know how robust the brain is to these kinds of hyper parameter distributions (i.e over what range do people vary and is it systematic). In ML my understanding is that at large scale models are generally pretty robust to small hyper parameter variations (allowing people to get away with cargo culting hyperparams from other related papers instead of always sweeping themselves) although of course really bad hyperparams destroy performance. The brain may also be less stable due to some combination of recurrent dynamics/active data selection leading to positive or negative loops, as well as just more weird architectural hyper parameters leading to more interactions and ways for things to go wrong.
I think one huge distinction to consider is performance at creating new ideas and capabilities vs performance at things that are already understood.
Human culture contains immense amounts of knowledge, and a major factor for how well you can perform is how much of that knowledge you can absorb (imagine programming without having Google to lookup APIs, or even worse, learning programming from scratch with zero instructions). This is probably a major factor in why human performance varies so much with g.
I don’t think this maps cleanly to the scaling question. On a first look you might think it means AI will inevitably face severe diminishing returns, like the people who are at the forefrunt of scientific knowledge. However:
AI can have a much broader base of human-generated knowledge than humans, and broad bases of knowledge usually enables fruitful cross-pollination across fields.
AI algorithms can be run massively in parallel to develop new ideas (whereas human population is plateauing and many humans are not capable of contributing to scientific progress).
Those new ideas can likely then be cheaply reintegrated into all the parallel systems (by integrating it into one instance and then copying it, whereas humans need to be individually educated, which is expensive), making it feasible to build further on them.
An AGI that can perform on merely human level is still faster than humans. A 1000x increase in speed (accounting for no need to rest) is sufficient to deliver research from the year 3000 within a year, for anything that’s theoretical or doesn’t require too much compute for simulations and such to develop. That’s FOOM enough, the distinction from an even greater disruption won’t matter in terms of AI risk.
To some extent yes speed can compensate for intelligence but this isn’t really related to the question of FOOM.
In theory, if we have an AGI which is human level but 1000x faster, it might be able to perform at the level of 1000 humans rather than a human from the year 3000. If we have a giant population of AGIs such that we can replicate the entire edifice of human science but running at 1000x faster, then sure. In practice though by Amdahl’s law such a speed increase would just move the bottleneck to something else (probably running experiments/gathering data in the real world) so the speedup would be much less.
The general point I agree with though that we don’t need foom or RSI for x-risk.
That’s what I meant, serial speedup of 1000x, and separately from that a sufficient population. Assuming 6 hours a day of intensive work for humans, 5 days a week, there is a 5.6x speedup from not needing to rest. With 3⁄4 words per token, a 1000x speedup given no need to rest requires generation speed of 240 tokens/s. LLMs can do about 20-100 tokens/s when continuing a single prompt. Response latency is already a problem in practice, so it’s likely to improve.
An extra 10 points of IQ is worth quite a lot.
I think “loss” is already measuring “intelligence” on a non-linear scale. The power laws aren’t that bad in my opinion. Double the parameters a few more times and we could have another von Neumann. Who lives forever.
Seems like a sweet deal to me.
Something I’m too sleep deprived to think clearly on right now but want your take on.
Which best describes these scaling effects:
Sublinear cumulative returns to cognitive investment from computational resources (model size, training data size, training compute budget)
Superlinearly diminishing marginal returns to cognitive investment from computational resources
While they are related, they aren’t quite the same (e.g. logarithmically growing cumulative returns are unbounded, while exponentially diminishing marginal returns imply bounded cumulative returns (geometric series with ratios <1 converge). And I haven’t yet played around with the maths enough to have confident takes on what a particular kind of cumulative returns implies about a particular kind of marginal returns and vice versa.
For now, I think I want to distinguish between those two terms (at least until I’ve worked out the maths and understand how they relate).
I tend to think of the scaling laws as sublinear cumulative returns (a la algorithms with superlinear time/space complexity [where returns are measured by the problem sizes the system can solve with a given compute budget]), but you’re way more informed on this than me (I cannot grok the scaling law papers).
No, we don’t. Please state the reason(s) explicitly.
I’m basing my thinking here primarily off of Herculano Houzel’s work. If you have reasons you think this is wrong or counterarguments, I would be very interested in them as this is a moderately important part of my general model of AI.
Does parameter count increase logarithmically with unit log loss? Is this a typo, or am I just confused about this?
This phenomenon is not really a paradox if general intelligence is not compact.
If general intelligence is more ensemble like, then most of the massive differences in cognitive abilities on various tasks (chess, theoretical research, etc.) is due to specialisation of neural circuitry for those domains.
That said, the existence of a g factor may weaken the ensemble general intelligence hypothesis. Or perhaps not. Some common metacognitive tasks (learning how to learn, memory, synthesising knowledge, abstraction, etc.) may form a common core that is relatively compact, while domain specific skills (chess, mathematics, literature) are more ensemble like.
This is definitely the case. My prior is relatively strong that intelligence is compact, at least for complex and general tasks and behaviours. Evidence for this comes from ML—the fact that the modern ML paradigm of huge network + lots of data + general optimiser being able to solve a large number of tasks is a fair bit of evidence for this. Other evidence is existence of g and cortical uniformity in general, as well as our flexibility at learning skills like chess, mathematics etc which we clearly do not have any evolutionarily innate specialisation for.
Of course some skills such as motor reflexes and a lot of behaviours are hardwired but generally we see that as intelligence and. generality grows these decrease in proportion.
What if we learn new domains by rewiring/specialising/developing new neural circuitry for them.
We have a general optimiser that does dedicated cross domain optimisation by developing narrow optimisers?
I think option 2 is the most reasonable explanation. We consider complex things some humans can perform but others can’t. Things that everyone can do are just too simple; while things no one can do are basically impossible. So we have a very limited definition of “complex” that can’t be transferred to different levels of intelligence.
I think the best explanations for it is a combo of 1 and 2. Specifically, I believe that the more intelligent behaviors only emerge in the last few bits of training, and thus scaling laws underestimate how valuable the later bits are. In other words the long tail bites hard, where the last few bits contain nearly all the intelligence.
More on this from Gwern here:
https://www.gwern.net/Scaling-hypothesis#why-does-pretraining-work
Another explanation from myself on how normally distributed intelligence gives rise to big differences:
Another explanation from myself:
My first thought was #2, that we overestimate the size of the IQ differences because we can only measure on the observed scale. But this doesn’t seem fully satisfactory. I know that connectivity is a very vogue concept and I don’t underestimate its importance, but I have recently been concerned that focusing on connectivity produces a concomitant overlooking of the importance of neuronal-intrinsic factors. One particular area of interest is synaptic cycling. I think about the importance of neuronal density and then consider how much could be gained by subtle additive genetic effects that lead to improved use/reuse of the same synapses. Without altering neuronal density at all, a 10% improvement in how quickly a synapse can form, a synaptic vesicle be repurposed, and a neuron be ready to fire again should effectively be tantamount to a ~10% gain in neuronal density. In other words, the architecture looks the same but performs at a substantially higher throughput.
This is a good idea! I hadn’t thought that much about specific synaptic efficiency metrics. If we think about this in a bit more detail, these would effectively corresponds to some kind of changes in hyper parameters for an ML model. I.e. more rapid synaptic changes = potential for higher learning rate effectively. The more rapid synaptic formation (and potentially pruning?) is harder to model in ML but I guess would be an increase in effective parameter count.
Thinking about these as changes in hyperparameters is probably the closest analogy from a ML perspective. I should note that my own area of expertise is genetic epidemiology and neuroscience, not ML, so I am less fluent discussing the computational domain than human-adjacent biological structures. At the risk of speaking outside my depth, I offer the following from the perspective of a geneticist/neuroscientist: My intuition (FWIW) is that all human brains are largely running extremely similar models, and that the large IQ differences observed are either due to 1) inter-individual variability in neuronal performance (the cycling aspect I reference above), or 2) the number of parameters that can be quickly called from storage. The former seems analogous to two machines running the same software but with an underlying difference in hardware (eg, clock rate), while the latter seems more analogous to two machines running the same software but with vastly different levels of RAM. I can’t decide whether having better functionality at the level of individual neurons is more likely to generate benefit in the “clock rate” or the “RAM” domain. Both seem plausible, and again, my apologies for jettisoning LLM analogies for more historical ones drawn from the PC era. At least I didn’t say some folks were still running vacuum tubes instead of transistors!