aogara comments on Train first VS prune first in neural networks.

aogara 9 Jul 2022 16:58 UTC
1 point
0
AF
“…nodes with the smallest standard deviation.” Does this mean nodes whose weights have the lowest absolute values?
- Donald Hobson 9 Jul 2022 18:04 UTC
  LW: 3 AF: 1
  0
  AF Parent
  Not quite. It means running the network on the training data. For each node, look at the values. (which will always be $\geq 0$ , as the activation function is relu) and taking the empirical standard deviation. So consider the randomness to be a random choice of input datapoint.
  - aogara 9 Jul 2022 20:58 UTC
    1 point
    0
    AF Parent
    Ah okay. Are there theoretical reasons to think that neurons with lower variance in activation would be better candidates for pruning? I guess it would be that the effect on those nodes is similar across different datapoints, so they can be pruned and their effects will be replicated by the rest of the network.
    - Donald Hobson 9 Jul 2022 23:46 UTC
      LW: 5 AF: 1
      0
      AF Parent
      Well if the node has no variance in its activation at all, then its constant, and pruning it will not change the networks behavior at all.
      I can prove an upper bound. Pruning a node with standard deviation X should increase the loss by at most KX, where K is the product of the operator norm of the weight matrices. The basic idea is that the network is a libshitz function, with libshitz constant K. So adding the random noise means randomness of standard deviation at most KX in the logit prediction. And as the logit is $log (p) - log (1 - p)$ , and an increase in $log (p)$ means a decrease in $log (1 - p)$ ,, then each of those must be perterbed by at most KX.
      What this means in practice is that, if the kernals are smallish, then the neurons with small standard deviation in activation aren’t effecting the output much. Of course, its possible for a neuron to have a large standard deviation and have its output totally ignored by the next layer. Its possible for a neuron to have a large standard deviation and be actively bad. Its possible for a tiny standard deviation to be amplified by large values in the kernels.