When the network is randomly initialized, there is a sub-network that is already decent at the task.
From what I can tell, the paper doesn’t demonstrate this—i.e. I don’t think they ever test the performance of a sub-network with random weights (rather they test the performance of a subnetwork after training only the subnetwork). Though maybe this isn’t what you meant, in which case you can ignore me :)
The original paper doesn’t demonstrate this but later papers do, or at least claim to. Here are several papers with quotes:
https://arxiv.org/abs/2103.09377 “In this paper, we propose (and prove) a stronger Multi-Prize Lottery Ticket Hypothesis: A sufficiently over-parameterized neural network with random weights contains several subnetworks (winning tickets) that (a) have comparable accuracy to a dense target network with learned weights (prize 1), (b) do not require any further training to achieve prize 1 (prize 2), and (c) is robust to extreme forms of quantization (i.e., binary weights and/or activation) (prize 3).”
https://arxiv.org/abs/2006.12156 ″An even stronger conjecture has been proven recently: Every sufficiently overparameterized network contains a subnetwork that, at random initialization, but without training, achieves comparable accuracy to the trained large network.”
https://arxiv.org/abs/2006.07990 The strong {\it lottery ticket hypothesis} (LTH) postulates that one can approximate any target neural network by only pruning the weights of a sufficiently over-parameterized random network. A recent work by Malach et al. \cite{MalachEtAl20} establishes the first theoretical analysis for the strong LTH: one can provably approximate a neural network of width d and depth l, by pruning a random one that is a factor O(d4l2) wider and twice as deep. This polynomial over-parameterization requirement is at odds with recent experimental research that achieves good approximation with networks that are a small factor wider than the target. In this work, we close the gap and offer an exponential improvement to the over-parameterization requirement for the existence of lottery tickets. We show that any target network of width d and depth l can be approximated by pruning a random network that is a factor O(log(dl)) wider and twice as deep.
Yeah, fair enough. I should amend the title of the question. Re: reinforcing the winning tickets: Isn’t that implied? If it’s not implied, would you not agree that it is happening? Plausibly, if there is a ticket at the beginning that does well at the task, and a ticket at the end that does well at the task, it’s reasonable to think that it’s the same ticket? Idk, I’m open to alternative suggestions now that you mention it...
I expect that there are probably a bunch of different neural networks that perform well at a given task. We sort of know this because you can train a dense neural network to high accuracy, and also prune it to get a definitely-different neural network that also has high accuracy. Is it the case that these sparse architectures are small enough that there’s only one optimum? Maybe, but IDK why I’d expect that.
Whoa, the thing you are arguing against is not at all what I had been saying—but maybe it was implied by what I was saying and I just didn’t realize it? I totally agree that there are many optima, not just one. Maybe we are talking past each other?
(Part of why I think the two tickets are the same is that the at-initialization ticket is found by taking the after-training ticket and rewinding it to the beginning! So for them not to be the same, the training process would need to kill the first ticket and then build a new ticket on exactly the same spot!)
I guess I’m imagining that ‘by default’, your distribution over which optimum SGD reaches should be basically uniform, and you need a convincing story to end up believing that it reliably gets to one specific optimum.
So for them not to be the same, the training process would need to kill the first ticket and then build a new ticket on exactly the same spot!
Yes, that’s exactly what I think happens. Training takes a long time, and I expect the weights in a ‘ticket’ to change based on the weights of the rest of the network (since those other weights have similar magnitude). I think the best way to see why I think that is to manually run thru the backpropagation algorithm.
If I’m wrong, it’s probably because of this paper that I don’t have time to read over right now (but that I do recommend you read).
Part of why I think the two tickets are the same is that the at-initialization ticket is found by taking the after-training ticket and rewinding it to the beginning!
This is true in the original LTH paper, but there the “at-initialization ticket” doesn’t actually perform well: it’s just easy to train to high performance.
In the multi-prize LTH paper, it is the case that the “at-initialization ticket” performs well, but they don’t find it by winding back the weights of a trained pruned network.
If you got multi-prize at-initialization tickets by winding back the weights of a trained pruned network, I would find that pretty convincing—the idea that they’d be totally different networks would seem like too much of a coincidence. But I would still want to actually check whether the weights were actually the same (which funnily enough isn’t trivial if you’re not familiar with a little-discussed symmetry of DNNs: for a hidden layer neuron with a ReLU activation function, you can scale the input weights up by a positive constant and the output weights down by the same constant without changing the functioning of the network).
OH this indeed changes everything (about what I had been thinking) thank you! I shall have to puzzle over these ideas some more then, and probably read the multi-prize paper more closely (I only skimmed it earlier)
Ah to be clear I am entirely basing my comments off of reading the abstracts (and skimming the multi-prize paper with an eye one develops after having been a ML PhD student for mumbles indistinctly years).
From what I can tell, the paper doesn’t demonstrate this—i.e. I don’t think they ever test the performance of a sub-network with random weights (rather they test the performance of a subnetwork after training only the subnetwork). Though maybe this isn’t what you meant, in which case you can ignore me :)
Yep, I agree that this question does not accurately describe the lottery ticket hypothesis.
The original paper doesn’t demonstrate this but later papers do, or at least claim to. Here are several papers with quotes:
https://arxiv.org/abs/2103.09377
“In this paper, we propose (and prove) a stronger Multi-Prize Lottery Ticket Hypothesis:
A sufficiently over-parameterized neural network with random weights contains several subnetworks (winning tickets) that (a) have comparable accuracy to a dense target network with learned weights (prize 1), (b) do not require any further training to achieve prize 1 (prize 2), and (c) is robust to extreme forms of quantization (i.e., binary weights and/or activation) (prize 3).”
https://arxiv.org/abs/2006.12156
″An even stronger conjecture has been proven recently: Every sufficiently overparameterized network contains a subnetwork that, at random initialization, but without training, achieves comparable accuracy to the trained large network.”
https://arxiv.org/abs/2006.07990
The strong {\it lottery ticket hypothesis} (LTH) postulates that one can approximate any target neural network by only pruning the weights of a sufficiently over-parameterized random network. A recent work by Malach et al. \cite{MalachEtAl20} establishes the first theoretical analysis for the strong LTH: one can provably approximate a neural network of width d and depth l, by pruning a random one that is a factor O(d4l2) wider and twice as deep. This polynomial over-parameterization requirement is at odds with recent experimental research that achieves good approximation with networks that are a small factor wider than the target. In this work, we close the gap and offer an exponential improvement to the over-parameterization requirement for the existence of lottery tickets. We show that any target network of width d and depth l can be approximated by pruning a random network that is a factor O(log(dl)) wider and twice as deep.
None of those quotes claim that training just reinforces the ‘winning tickets’. Also those are referred to as the “strong” or “multi-ticket” LTH.
*multi-prize
Oops
Yeah, fair enough. I should amend the title of the question. Re: reinforcing the winning tickets: Isn’t that implied? If it’s not implied, would you not agree that it is happening? Plausibly, if there is a ticket at the beginning that does well at the task, and a ticket at the end that does well at the task, it’s reasonable to think that it’s the same ticket? Idk, I’m open to alternative suggestions now that you mention it...
I don’t think it’s implied, and I’m not confident that it’s happening. There are lots of neural networks!
Hmmm, ok. Can you say more about why? Isn’t the simplest explanation that the two tickets are the same?
I expect that there are probably a bunch of different neural networks that perform well at a given task. We sort of know this because you can train a dense neural network to high accuracy, and also prune it to get a definitely-different neural network that also has high accuracy. Is it the case that these sparse architectures are small enough that there’s only one optimum? Maybe, but IDK why I’d expect that.
Whoa, the thing you are arguing against is not at all what I had been saying—but maybe it was implied by what I was saying and I just didn’t realize it? I totally agree that there are many optima, not just one. Maybe we are talking past each other?
(Part of why I think the two tickets are the same is that the at-initialization ticket is found by taking the after-training ticket and rewinding it to the beginning! So for them not to be the same, the training process would need to kill the first ticket and then build a new ticket on exactly the same spot!)
I guess I’m imagining that ‘by default’, your distribution over which optimum SGD reaches should be basically uniform, and you need a convincing story to end up believing that it reliably gets to one specific optimum.
Yes, that’s exactly what I think happens. Training takes a long time, and I expect the weights in a ‘ticket’ to change based on the weights of the rest of the network (since those other weights have similar magnitude). I think the best way to see why I think that is to manually run thru the backpropagation algorithm.
If I’m wrong, it’s probably because of this paper that I don’t have time to read over right now (but that I do recommend you read).
Oh here’s where I think things went wrong:
This is true in the original LTH paper, but there the “at-initialization ticket” doesn’t actually perform well: it’s just easy to train to high performance.
In the multi-prize LTH paper, it is the case that the “at-initialization ticket” performs well, but they don’t find it by winding back the weights of a trained pruned network.
If you got multi-prize at-initialization tickets by winding back the weights of a trained pruned network, I would find that pretty convincing—the idea that they’d be totally different networks would seem like too much of a coincidence. But I would still want to actually check whether the weights were actually the same (which funnily enough isn’t trivial if you’re not familiar with a little-discussed symmetry of DNNs: for a hidden layer neuron with a ReLU activation function, you can scale the input weights up by a positive constant and the output weights down by the same constant without changing the functioning of the network).
OH this indeed changes everything (about what I had been thinking) thank you! I shall have to puzzle over these ideas some more then, and probably read the multi-prize paper more closely (I only skimmed it earlier)
Ah to be clear I am entirely basing my comments off of reading the abstracts (and skimming the multi-prize paper with an eye one develops after having been a ML PhD student for mumbles indistinctly years).