Steven Byrnes comments on Source code size vs learned model size in ML and in humans?

Steven Byrnes 27 May 2020 14:47 UTC
4 points
OK, I think that helps.

It sounds like your question should really be more like how many programmer-hours go into putting domain-specific content / capabilities into an AI. (You can disagree.) If it’s very high, then it’s the Robin-Hanson-world where different companies make AI-for-domain-X, AI-for-domain-Y, etc., and they trade and collaborate. If it’s very low, then it’s more plausible that someone will have a good idea and Bam, they have an AGI. (Although it might still require huge amounts of compute.)

If so, I don’t think the information content of the weights of a trained model is relevant. The weights are learned automatically. Changing the code from num_hidden_layers = 10 to num_hidden_layers = 100 is not 10× the programmer effort. (It may or may not require more compute, and it may or may not require more labeled examples, and it may or may not require more hyperparameter tuning, but those are all different things, and in no case is there any reason to think it’s a factor of 10, except maybe some aspects of compute.)

I don’t think the size of the PyTorch codebase is relevant either.

I agree that the size of the human genome is relevant, as long as we all keep in mind that it’s a massive upper bound, because perhaps a vanishingly small fraction of that is “domain-specific content / capabilities”. Even within the brain, you have to synthesize tons of different proteins, control the concentrations of tons of chemicals, etc. etc.

I think the core of your question is generalizability. If you have AlphaStar but want to control a robot instead, how much extra code do you need to write? Do insights in computer vision help with NLP and vice-versa? That kind of stuff. I think generalizability has been pretty high in AI, although maybe that statement is so vague as to be vacuous. I’m thinking, for example, it’s not like we have “BatchNorm for machine translation” and “BatchNorm for image segmentation” etc. It’s the same BatchNorm.

On the brain side, I’m a big believer in the theory that the neocortex has one algorithm which simultaneously does planning, action, classification, prediction, etc. (The merging of action and understanding in particular is explained in my post here, see also Planning By Probabilistic Inference.) So that helps with generalizability. And I already mentioned my post on cortical uniformity. I think a programmer who knows the core neocortical algorithm and wants to then imitate the whole neocortex would mainly need (1) a database of “innate” region-to-region connections, organized by connection type (feedforward, feedback, hormone receptors) and structure (2D array of connections vs 1D, etc.), (2) a database of region-specific hyperparameters, especially when the region should lock itself down to prevent further learning (“sensitive periods”). Assuming that’s the right starting point, I don’t have a great sense for how many bits of data this is, but I think the information is out there in the developmental neuroscience literature. My wild guess right now would be on the order of a few KB, but with very low confidence. It’s something I want to look into more when I get a chance. Note also that the would-be AGI engineer can potentially just figure out those few KB from the neuroscience literature, rather than discovering it in a more laborious way.

Oh, you also probably need code for certain non-neocortex functions like flagging human speech sounds as important to attend to etc. I suspect that that particular example is about as straightforward as it sounds, but there might be other things that are hard to do, or where it’s not clear what needs to be done. Of course, for an aligned AGI, there could potentially be a lot of work required to sculpt the reward function.

Just thinking out loud :)