Steven Byrnes comments on Scaling laws vs individual differences

Steven Byrnes 10 Jan 2023 14:00 UTC
13 points
1
I suspect that individual differences in intrinsic motivation / reward function are at least as important as anything you mentioned. In particular, I disagree with your statement “dataset input is not particularly affectable by the genome”. If person A finds it enjoyable and invigorating to be around people, and to pay attention to people, and to think about people, then their lifetime “dataset” will be very people-centric. Conversely, if someone spends their early childhood rotating shapes in their head whenever they have a free moment, they’re going to get really good at it. I have an 8yo kid who loves math and thinks about it all the time. Like one time we were sitting at the dinner table talking about paint colors or whatever, and he interrupted the conversation to tell us something he had figured out about exponentiation. I don’t know what kind of reward function leads to that, but clearly it’s possible.
You also neglected to mention hyperparameters, I think. (Actually, maybe it’s part of your 3.) For example, I imagine that in ML, changing learning rate a bit (for example) can have an outsized effect on final performance. I think there are a lot of things in that general category in the brain. For example, what is the exact curve relating milliseconds-of-delay-versus-synapse-plasticity in STDP? It probably depends on lots of little things in the genome (SNPs in various proteins involved in the process, or whatever). And probably some possible milliseconds-of-delay-versus-synapse-plasticity curves are better than others for learning.
- gwern 11 Jan 2023 3:34 UTC
  8 points
  1
  Parent
  
  I think there are a lot of things in that general category in the brain.
  
  Yes, volume is definitely not the only thing going on with human brains. Human brains are not identical, the way ANNs can be identical save for a knob in a config file increasing the parameter count. (Nor is parameter count the only thing going on with DL scaling, for that matter.) Intelligence is highly polygenic, and the brain volume genetic correlations with intelligence are, while apparently causal, much less than 1 (while intelligence genetically correlates with lot of other things); the brain imaging studies also show predicting intelligence taps into a lot more aspects of static neuroanatomy or dynamic patterns than simply brain volume (or some deeper neuron-count). Things like myelination and mitochondrial function will matter, and will support the development processes. General bodily integrity and health and mutation load on all bodily systems will matter. All of these will influence development and the ability to develop connected-but-not-too-connected brain networks which can flexibly coordinate to support fluid intelligence activity. So while you can fiddle the knob and train the same model at different parameter scales and extract the power law, you can’t do that when you compare human brains: it’s as if not only are all the hyperparameters randomized a little each run, the GPUs trained on will convert electricity to FLOPs at wildly different rates, some GPUs just won’t multiply numbers quite right (each one multiplying wrongly in a different way), the occasional layer in a checkpoint might be replaced with some Gaussian noise… (So you can see the influence of volume at the species level because you’re comparing group means where all the noise washes out, but then at individual level it may be much more confusing.)
  - beren 13 Jan 2023 16:02 UTC
    1 point
    0
    Parent
    the brain imaging studies also show predicting intelligence taps into a lot more aspects of static neuroanatomy or dynamic patterns than simply brain volume
    Do you have links for these studies? Would leave to have a read about the static and dynamic correlates of g are from brain imaging!
- beren 10 Jan 2023 14:24 UTC
  7 points
  1
  Parent
  I largely disagree about the intrinsic motivation/reward function points. There is a lot of evidence that there is at least some amount of general intelligence which is independent of interest in particular fields/topics. Of course, if you have a high level of intelligence + interest then your dataset will be heavily oriented towards that topic and you will gain a lot of skill in it, but the underlying aptitude/intelligence can be factored out of this.
  How exactly specific interests are encoded is a different and also super fascinating question! It definitely isn’t a pure ‘bit prediction’ intrinsic curiosity since different people seem to care a lot about different kinds of bits. It is at least somewhat affected by external culture / datasets but not entirely (people can often be interested in things against cultural pressure or often before they really know what their interest is). It doesn’t seem super influenced by external reward in a lot of cases. To some extent it ties in with intrinsic aptitude (people tend to be interested in things they are good at) but of course this is at least somewhat circular since people tend to get better at things they are interested in, ceteris paribus.
  The hyperparameters is a good point. I was thinking about this largely as architectural changes but I think that I was wrong about this they are much more continuous and also potentially much more flexible genetically. This seems to be a better and more likely explanation for continuous IQ distributions than architecture directly. It would definitely be interesting to know how robust the brain is to these kinds of hyper parameter distributions (i.e over what range do people vary and is it systematic). In ML my understanding is that at large scale models are generally pretty robust to small hyper parameter variations (allowing people to get away with cargo culting hyperparams from other related papers instead of always sweeping themselves) although of course really bad hyperparams destroy performance. The brain may also be less stable due to some combination of recurrent dynamics/active data selection leading to positive or negative loops, as well as just more weird architectural hyper parameters leading to more interactions and ways for things to go wrong.