Not quite. It means running the network on the training data. For each node, look at the values. (which will always be ≥0, as the activation function is relu) and taking the empirical standard deviation. So consider the randomness to be a random choice of input datapoint.
Ah okay. Are there theoretical reasons to think that neurons with lower variance in activation would be better candidates for pruning? I guess it would be that the effect on those nodes is similar across different datapoints, so they can be pruned and their effects will be replicated by the rest of the network.
Well if the node has no variance in its activation at all, then its constant, and pruning it will not change the networks behavior at all.
I can prove an upper bound. Pruning a node with standard deviation X should increase the loss by at most KX, where K is the product of the operator norm of the weight matrices. The basic idea is that the network is a libshitz function, with libshitz constant K. So adding the random noise means randomness of standard deviation at most KX in the logit prediction. And as the logit is log(p)−log(1−p), and an increase in log(p) means a decrease in log(1−p),, then each of those must be perterbed by at most KX.
What this means in practice is that, if the kernals are smallish, then the neurons with small standard deviation in activation aren’t effecting the output much. Of course, its possible for a neuron to have a large standard deviation and have its output totally ignored by the next layer. Its possible for a neuron to have a large standard deviation and be actively bad. Its possible for a tiny standard deviation to be amplified by large values in the kernels.
“…nodes with the smallest standard deviation.” Does this mean nodes whose weights have the lowest absolute values?
Not quite. It means running the network on the training data. For each node, look at the values. (which will always be ≥0, as the activation function is relu) and taking the empirical standard deviation. So consider the randomness to be a random choice of input datapoint.
Ah okay. Are there theoretical reasons to think that neurons with lower variance in activation would be better candidates for pruning? I guess it would be that the effect on those nodes is similar across different datapoints, so they can be pruned and their effects will be replicated by the rest of the network.
Well if the node has no variance in its activation at all, then its constant, and pruning it will not change the networks behavior at all.
I can prove an upper bound. Pruning a node with standard deviation X should increase the loss by at most KX, where K is the product of the operator norm of the weight matrices. The basic idea is that the network is a libshitz function, with libshitz constant K. So adding the random noise means randomness of standard deviation at most KX in the logit prediction. And as the logit is log(p)−log(1−p), and an increase in log(p) means a decrease in log(1−p),, then each of those must be perterbed by at most KX.
What this means in practice is that, if the kernals are smallish, then the neurons with small standard deviation in activation aren’t effecting the output much. Of course, its possible for a neuron to have a large standard deviation and have its output totally ignored by the next layer. Its possible for a neuron to have a large standard deviation and be actively bad. Its possible for a tiny standard deviation to be amplified by large values in the kernels.