RogerDearnaley comments on Sparse MLP Distillation

RogerDearnaley 15 Jan 2024 22:24 UTC
4 points
0
so that each neuron in the original MLP is simulated by two or three very similar neurons.
If you were using any form of weight decay, this is to be expected.