I expect that there are probably a bunch of different neural networks that perform well at a given task. We sort of know this because you can train a dense neural network to high accuracy, and also prune it to get a definitely-different neural network that also has high accuracy. Is it the case that these sparse architectures are small enough that there’s only one optimum? Maybe, but IDK why I’d expect that.
Whoa, the thing you are arguing against is not at all what I had been saying—but maybe it was implied by what I was saying and I just didn’t realize it? I totally agree that there are many optima, not just one. Maybe we are talking past each other?
(Part of why I think the two tickets are the same is that the at-initialization ticket is found by taking the after-training ticket and rewinding it to the beginning! So for them not to be the same, the training process would need to kill the first ticket and then build a new ticket on exactly the same spot!)
I guess I’m imagining that ‘by default’, your distribution over which optimum SGD reaches should be basically uniform, and you need a convincing story to end up believing that it reliably gets to one specific optimum.
So for them not to be the same, the training process would need to kill the first ticket and then build a new ticket on exactly the same spot!
Yes, that’s exactly what I think happens. Training takes a long time, and I expect the weights in a ‘ticket’ to change based on the weights of the rest of the network (since those other weights have similar magnitude). I think the best way to see why I think that is to manually run thru the backpropagation algorithm.
If I’m wrong, it’s probably because of this paper that I don’t have time to read over right now (but that I do recommend you read).
Part of why I think the two tickets are the same is that the at-initialization ticket is found by taking the after-training ticket and rewinding it to the beginning!
This is true in the original LTH paper, but there the “at-initialization ticket” doesn’t actually perform well: it’s just easy to train to high performance.
In the multi-prize LTH paper, it is the case that the “at-initialization ticket” performs well, but they don’t find it by winding back the weights of a trained pruned network.
If you got multi-prize at-initialization tickets by winding back the weights of a trained pruned network, I would find that pretty convincing—the idea that they’d be totally different networks would seem like too much of a coincidence. But I would still want to actually check whether the weights were actually the same (which funnily enough isn’t trivial if you’re not familiar with a little-discussed symmetry of DNNs: for a hidden layer neuron with a ReLU activation function, you can scale the input weights up by a positive constant and the output weights down by the same constant without changing the functioning of the network).
OH this indeed changes everything (about what I had been thinking) thank you! I shall have to puzzle over these ideas some more then, and probably read the multi-prize paper more closely (I only skimmed it earlier)
Ah to be clear I am entirely basing my comments off of reading the abstracts (and skimming the multi-prize paper with an eye one develops after having been a ML PhD student for mumbles indistinctly years).
I don’t think it’s implied, and I’m not confident that it’s happening. There are lots of neural networks!
Hmmm, ok. Can you say more about why? Isn’t the simplest explanation that the two tickets are the same?
I expect that there are probably a bunch of different neural networks that perform well at a given task. We sort of know this because you can train a dense neural network to high accuracy, and also prune it to get a definitely-different neural network that also has high accuracy. Is it the case that these sparse architectures are small enough that there’s only one optimum? Maybe, but IDK why I’d expect that.
Whoa, the thing you are arguing against is not at all what I had been saying—but maybe it was implied by what I was saying and I just didn’t realize it? I totally agree that there are many optima, not just one. Maybe we are talking past each other?
(Part of why I think the two tickets are the same is that the at-initialization ticket is found by taking the after-training ticket and rewinding it to the beginning! So for them not to be the same, the training process would need to kill the first ticket and then build a new ticket on exactly the same spot!)
I guess I’m imagining that ‘by default’, your distribution over which optimum SGD reaches should be basically uniform, and you need a convincing story to end up believing that it reliably gets to one specific optimum.
Yes, that’s exactly what I think happens. Training takes a long time, and I expect the weights in a ‘ticket’ to change based on the weights of the rest of the network (since those other weights have similar magnitude). I think the best way to see why I think that is to manually run thru the backpropagation algorithm.
If I’m wrong, it’s probably because of this paper that I don’t have time to read over right now (but that I do recommend you read).
Oh here’s where I think things went wrong:
This is true in the original LTH paper, but there the “at-initialization ticket” doesn’t actually perform well: it’s just easy to train to high performance.
In the multi-prize LTH paper, it is the case that the “at-initialization ticket” performs well, but they don’t find it by winding back the weights of a trained pruned network.
If you got multi-prize at-initialization tickets by winding back the weights of a trained pruned network, I would find that pretty convincing—the idea that they’d be totally different networks would seem like too much of a coincidence. But I would still want to actually check whether the weights were actually the same (which funnily enough isn’t trivial if you’re not familiar with a little-discussed symmetry of DNNs: for a hidden layer neuron with a ReLU activation function, you can scale the input weights up by a positive constant and the output weights down by the same constant without changing the functioning of the network).
OH this indeed changes everything (about what I had been thinking) thank you! I shall have to puzzle over these ideas some more then, and probably read the multi-prize paper more closely (I only skimmed it earlier)
Ah to be clear I am entirely basing my comments off of reading the abstracts (and skimming the multi-prize paper with an eye one develops after having been a ML PhD student for mumbles indistinctly years).