I have an objection to the point about how AI models will be more efficient because they don’t need to do massive parallelization:
Massive parallelization is useful for AI models too and for somewhat similar reasons. Parallel computation allows the model to spit out a result more quickly. In the biological setting, this is great because it means you can move out of the way when a tiger jumps toward you. In the ML setting, this is great because it allows the gradient to be computed more quickly. The disadvantage of parallelization is that it means that more hardware is required. In the biological setting, this means bigger brains. Big brains are costly. They use up a lot of energy and make childbearing more difficult as the skulls need to fit through the birth canal.
In the ML setting, however, big brains are not as costly. We don’t need to fit our computers in a skull. So, it is not obvious to me that ML models will do fewer computations in parallel than biological brains.
GPT-3 uses 96 layers (decoder blocks). That isn’t very many serial computations. If a matrix multiplication, softmax, relu, or vector addition count as atomic computations, then there are 11 serial computations per layer, so that’s only 1056 serial computations. It is unclear how to compare this to biological neurons as each neuron may require a number of these serial computations to properly simulate.
PALM has 3 times more parameters than GPT-3 but only has 118 layers.
I have an objection to the point about how AI models will be more efficient because they don’t need to do massive parallelization:
Massive parallelization is useful for AI models too and for somewhat similar reasons. Parallel computation allows the model to spit out a result more quickly. In the biological setting, this is great because it means you can move out of the way when a tiger jumps toward you. In the ML setting, this is great because it allows the gradient to be computed more quickly. The disadvantage of parallelization is that it means that more hardware is required. In the biological setting, this means bigger brains. Big brains are costly. They use up a lot of energy and make childbearing more difficult as the skulls need to fit through the birth canal.
In the ML setting, however, big brains are not as costly. We don’t need to fit our computers in a skull. So, it is not obvious to me that ML models will do fewer computations in parallel than biological brains.
Some relevant information:
According to Scaling Laws for Neural Language Models, model performance depends strongly on model size but very weakly on shape (depth vs width).
An explanation for the above is that deep residual networks have been observed to behave like Ensembles of shallow networks.
GPT-3 uses 96 layers (decoder blocks). That isn’t very many serial computations. If a matrix multiplication, softmax, relu, or vector addition count as atomic computations, then there are 11 serial computations per layer, so that’s only 1056 serial computations. It is unclear how to compare this to biological neurons as each neuron may require a number of these serial computations to properly simulate.
PALM has 3 times more parameters than GPT-3 but only has 118 layers.