Here are some things about neural networks that I used to find puzzling but now feel that I have adequate explanations for. The theory behind these answers didn’t start to be understood until well after the correct things to do were found by chance or blind imitation of brains.
Why is good optimization possible?
Neural networks typically deal with “non-convex” optimization problems. Traditionally, using gradient descent for those was considered impractical, because it would rapidly get stuck in local minima. That was part of the motivation for evolutionary approaches.
Why, then, are neural networks trainable by gradient descent? Because if you add enough extra dimensions, non-convex problems become convex. Empirically, with massive overparameterization, the energy landscape tends to have many saddle points but few local minima. Showing theoretical convergence guarantees for overparameterized networks is a recent and ongoing research topic; see eg this.
As I previously noted, this is why sparse networks from iterative magnitude pruning have good performance, but sparse networks generally can’t be trained from scratch as well as dense networks.
This also explains some “thresholds” of neural network performance vs size: when overparameterization is proportional to problem non-convexity, good training becomes possible and performance improves significantly.
Why is generalization possible?
Adding enough free variables can turn non-convex problems into convex ones. Why didn’t people just do that in the past, then? Because far before you get to that point, the extra free variables led to overfitting that reduced test performance. People tried simple regularization like neural networks use, and that was completely inadequate.
Overparameterized neural networks can learn random data. Why do networks with fairly simple regularization tend to generalize?
Distance in neural network latent spaces being meaningful is basically the main useful thing about neural networks. Another phrasing of the above question is: Why is distance in latent spaces meaningful for latent space points not in the training set?
A few years back, some people noticed that neural network activation functions have spectral bias. With the types of activation functions used, low-frequency relationships are fit more quickly than high-frequency ones. That causes latent space relationships to be preferentially fit in such a way that point distances are related to point similarity. This can then be tuned by simple regularization: if you have spectral bias and balance learning rate vs regularization globally, you can control the frequency range learned.
An obvious way to test this theory of neural network generalization is to find some activation functions with relatively low spectral bias, and see how they perform. This paper tries a “hat” activation function, and finds that loss on the training set goes down much faster but test accuracy is much worse. This paper does some relevant tests on spectral bias.
It’s known that there is no universal best activation function. The optimal choice varies with:
different problem types
regularization settings
layer depth
I think using different activation functions for different depths is a semi-common technique at large AI labs now. This spectral bias framework can explain variations in relative performance of activation functions as spectral bias matching.
Why not mixed activation functions?
There are some reasons to think mixing activation functions in the same layer would be better:
Neural networks have many equivalent permutations of their variables. By mixing different activation functions in the same later, fewer permutations would be equivalent, which increases expressive power.
Using multiple activation functions could allow data to be fit more naturally, reducing the amount of inefficiently approximating functions with other functions.
Brains have mixed activation functions—per synapse, which is like having an activation function per weight.
Yet, empirically, mixing activation functions in the same layer tends to give slightly worse performance, and optimizing an activation mix for one situation doesn’t give good results for other situations. Why? My working theory is as follows:
Training involves shifting of data representation between neurons. When neurons have different activation functions, they’re less compatible so such shifting is harder.
Overfitting is based on the lowest spectral bias among the activation functions, so even a fraction of neurons having activation functions with low spectral bias is bad.
The greater expressivity of mixed activation functions then partially cancels out those disadvantages, and the result is performance that’s generally slightly worse.
Interesting post! Do you have papers for the claims on why mixed activation functions perform worse? This is something I have thought about a little bit but not looked deeply into. Would appreciate links here? My naive thinking is that it mostly doesn’t work due to difficulties of conditioning and keeping the loss landscape smooth and low curvature with different activation functions in a layer. With a single activation function, it is relatively straightforward to design an initialization that doesn’t blow up—with mixed ones it seems your space of potential numerical difficulties increases massively.
No, there are no papers on that topic that I know of. There are relatively few papers that work on mixed activation functions at all. You should understand that papers that don’t show at least a marginal increase on some niche benchmark tend not to get published. So, much of the work on mixed activation functions went unpublished.
But I can link to papers on testing mixed activation functions. Here’s a Bachelor’s thesis from 2022 that did relatively extensive testing. They did evolution of activation function sets for a particular application and got slightly better performance than ReLU/Swish.
That’s an unfair comparison because activation function adaptation to a particular task can improve performance. The thesis did also compare its evolutionary search on single functions, and that approach did about as well as the mixed functions.
So far so good, but then, when the network was scaled up from VGG-HE-2 to VGG-HE-4, their evolved activation sets all got worse, while ReLU and Swish got better. Their best mixed activation set went from 80% to 10% accuracy as the network was scaled up, while the evolved single functions held up better but all became worse than Swish.
One of the issues I mentioned with mixed activation functions is specific to SGD training; there’s also been some work on using them with neuroevolution.
Is this generalizable enough to have anything to say on slow vs fast takeoff? For example can you show that you will need a massive net to develop new understandings or more accuracy on existing tasks?
This post is more relevant to that.
Could you explain why this is true?
see this post