Nice article. It was easy to read, the explanations were mostly clear. I’ve finally learned what LTH is, yay. I have a few questions.
What is a subnetwork of a convolutional neural network? Is it the function obtained by zeroing some non-bias parameters? Does the same also hold for resnets, i.e. all skip connections remain?
Can you add a qualification in the very beginning saying that all this applies to multilayer perceptrons and CNNs in supervised learning rather than all NNs?
You often use the word dense, and sometimes I don’t understand if you mean a fully connected layer or an unpruned (fully connected or convolutional) layer. Can you clarify it for all occurrences where it might be unclear?
In some follow-up work, Frankle and Carbin also found it necessary to use “late resetting”, which means resetting the weights of the subnetwork not to their original values from initialization but to their values from 1% − 7% of the way through training the dense network.
Do you mean weight that the network had after 1%-7% of all iterations have happened? Or do you mean moving 1%-7% of the line segment from the initial weights to the weights when the training was stopped?
In the case of convolutional neural networks, the subnetwork can vary depending on how you choose the pruning granularity. In the LTH, they apply an unstructured pruning, i.e. they replace individual weights in the convolution filters by a 0 value. But you could imagine applying a structured pruning, replacing vectors, kernels or even complete filters by zeroes. The architecture of the subnetwork is thus not necessarily different, but the information that flows in the network will be as your network is now sparse. So, you generally don’t want to change the overall architecture of the network, keeping skip connections intact for example.
Although the LTH was empirically discovered on MLP and CNNs on supervised learning, I see more and more occurrences of LTH on other training paradigms, e.g. this one in the RL context
AFAICT, when the word “dense” was used here, it was always in opposition to “sparse”.
Concerning the “late resetting”, your first intuition was correct, instead of resetting the weights to their value at iteration 0 (the initialization), they reset them to their value at a later iteration (after 1%-7% of the total number of iterations). They’ve actually done another paper studying what happens in early training and why the “late resetting” might make sense.
Nice article. It was easy to read, the explanations were mostly clear. I’ve finally learned what LTH is, yay. I have a few questions.
What is a subnetwork of a convolutional neural network? Is it the function obtained by zeroing some non-bias parameters? Does the same also hold for resnets, i.e. all skip connections remain?
Can you add a qualification in the very beginning saying that all this applies to multilayer perceptrons and CNNs in supervised learning rather than all NNs?
You often use the word dense, and sometimes I don’t understand if you mean a fully connected layer or an unpruned (fully connected or convolutional) layer. Can you clarify it for all occurrences where it might be unclear?
Do you mean weight that the network had after 1%-7% of all iterations have happened? Or do you mean moving 1%-7% of the line segment from the initial weights to the weights when the training was stopped?
In the case of convolutional neural networks, the subnetwork can vary depending on how you choose the pruning granularity. In the LTH, they apply an unstructured pruning, i.e. they replace individual weights in the convolution filters by a 0 value. But you could imagine applying a structured pruning, replacing vectors, kernels or even complete filters by zeroes. The architecture of the subnetwork is thus not necessarily different, but the information that flows in the network will be as your network is now sparse. So, you generally don’t want to change the overall architecture of the network, keeping skip connections intact for example.
Although the LTH was empirically discovered on MLP and CNNs on supervised learning, I see more and more occurrences of LTH on other training paradigms, e.g. this one in the RL context
AFAICT, when the word “dense” was used here, it was always in opposition to “sparse”.
Concerning the “late resetting”, your first intuition was correct, instead of resetting the weights to their value at iteration 0 (the initialization), they reset them to their value at a later iteration (after 1%-7% of the total number of iterations). They’ve actually done another paper studying what happens in early training and why the “late resetting” might make sense.
Hope that helps !