Part of why I think the two tickets are the same is that the at-initialization ticket is found by taking the after-training ticket and rewinding it to the beginning!
This is true in the original LTH paper, but there the “at-initialization ticket” doesn’t actually perform well: it’s just easy to train to high performance.
In the multi-prize LTH paper, it is the case that the “at-initialization ticket” performs well, but they don’t find it by winding back the weights of a trained pruned network.
If you got multi-prize at-initialization tickets by winding back the weights of a trained pruned network, I would find that pretty convincing—the idea that they’d be totally different networks would seem like too much of a coincidence. But I would still want to actually check whether the weights were actually the same (which funnily enough isn’t trivial if you’re not familiar with a little-discussed symmetry of DNNs: for a hidden layer neuron with a ReLU activation function, you can scale the input weights up by a positive constant and the output weights down by the same constant without changing the functioning of the network).
OH this indeed changes everything (about what I had been thinking) thank you! I shall have to puzzle over these ideas some more then, and probably read the multi-prize paper more closely (I only skimmed it earlier)
Ah to be clear I am entirely basing my comments off of reading the abstracts (and skimming the multi-prize paper with an eye one develops after having been a ML PhD student for mumbles indistinctly years).
Oh here’s where I think things went wrong:
This is true in the original LTH paper, but there the “at-initialization ticket” doesn’t actually perform well: it’s just easy to train to high performance.
In the multi-prize LTH paper, it is the case that the “at-initialization ticket” performs well, but they don’t find it by winding back the weights of a trained pruned network.
If you got multi-prize at-initialization tickets by winding back the weights of a trained pruned network, I would find that pretty convincing—the idea that they’d be totally different networks would seem like too much of a coincidence. But I would still want to actually check whether the weights were actually the same (which funnily enough isn’t trivial if you’re not familiar with a little-discussed symmetry of DNNs: for a hidden layer neuron with a ReLU activation function, you can scale the input weights up by a positive constant and the output weights down by the same constant without changing the functioning of the network).
OH this indeed changes everything (about what I had been thinking) thank you! I shall have to puzzle over these ideas some more then, and probably read the multi-prize paper more closely (I only skimmed it earlier)
Ah to be clear I am entirely basing my comments off of reading the abstracts (and skimming the multi-prize paper with an eye one develops after having been a ML PhD student for mumbles indistinctly years).