The way I think of the Lottery Ticket Hypothesis is in terms of the procedure for actually finding a lottery ticket: start with a random initialised network, then train it, then look at the weights of the network that are the largest after training (say keep only 10% or 1% of the weights), go back to the initial random network and drop all the weights that won’t end up being the largest, now train that highly sparse network and you’ll end up with close to the same end performance, even though you’re using a much smaller network.
This seems to mean that all those weights we dropped don’t really have any effect on the training, we might have imagined that weird nonlinear effects might mean that getting to a good final solution would require the presence of weights that in the end will be useless, those weights might have been “catalysts”, helping the network arrive at its final solution, but not themselves useful for prediction. But no, it seems that if weights are going to end up useless, then they’re useless from the beginning.
The way I think of the Lottery Ticket Hypothesis is in terms of the procedure for actually finding a lottery ticket: start with a random initialised network, then train it, then look at the weights of the network that are the largest after training (say keep only 10% or 1% of the weights), go back to the initial random network and drop all the weights that won’t end up being the largest, now train that highly sparse network and you’ll end up with close to the same end performance, even though you’re using a much smaller network.
This seems to mean that all those weights we dropped don’t really have any effect on the training, we might have imagined that weird nonlinear effects might mean that getting to a good final solution would require the presence of weights that in the end will be useless, those weights might have been “catalysts”, helping the network arrive at its final solution, but not themselves useful for prediction. But no, it seems that if weights are going to end up useless, then they’re useless from the beginning.
Right, that’s basically the picture/experiment from the original paper.
Oh, I wasn’t claiming originality, just trying to give some background to people who might have stumbled here.