Donald Hobson comments on Train first VS prune first in neural networks.

Donald Hobson 9 Jul 2022 23:46 UTC
LW: 5 AF: 1
0
AF
Well if the node has no variance in its activation at all, then its constant, and pruning it will not change the networks behavior at all.
I can prove an upper bound. Pruning a node with standard deviation X should increase the loss by at most KX, where K is the product of the operator norm of the weight matrices. The basic idea is that the network is a libshitz function, with libshitz constant K. So adding the random noise means randomness of standard deviation at most KX in the logit prediction. And as the logit is $log (p) - log (1 - p)$ , and an increase in $log (p)$ means a decrease in $log (1 - p)$ ,, then each of those must be perterbed by at most KX.
What this means in practice is that, if the kernals are smallish, then the neurons with small standard deviation in activation aren’t effecting the output much. Of course, its possible for a neuron to have a large standard deviation and have its output totally ignored by the next layer. Its possible for a neuron to have a large standard deviation and be actively bad. Its possible for a tiny standard deviation to be amplified by large values in the kernels.