Daniel Kokotajlo comments on Does the lottery ticket hypothesis suggest the scaling hypothesis?

Daniel Kokotajlo 22 Apr 2021 5:25 UTC
5 points
0
AF
The original paper doesn’t demonstrate this but later papers do, or at least claim to. Here are several papers with quotes:
https://arxiv.org/abs/2103.09377
“In this paper, we propose (and prove) a stronger Multi-Prize Lottery Ticket Hypothesis:
A sufficiently over-parameterized neural network with random weights contains several subnetworks (winning tickets) that (a) have comparable accuracy to a dense target network with learned weights (prize 1), (b) do not require any further training to achieve prize 1 (prize 2), and (c) is robust to extreme forms of quantization (i.e., binary weights and/or activation) (prize 3).”
https://arxiv.org/abs/2006.12156
″An even stronger conjecture has been proven recently: Every sufficiently overparameterized network contains a subnetwork that, at random initialization, but without training, achieves comparable accuracy to the trained large network.”
https://arxiv.org/abs/2006.07990
The strong {\it lottery ticket hypothesis} (LTH) postulates that one can approximate any target neural network by only pruning the weights of a sufficiently over-parameterized random network. A recent work by Malach et al. \cite{MalachEtAl20} establishes the first theoretical analysis for the strong LTH: one can provably approximate a neural network of width d and depth l, by pruning a random one that is a factor O(d4l2) wider and twice as deep. This polynomial over-parameterization requirement is at odds with recent experimental research that achieves good approximation with networks that are a small factor wider than the target. In this work, we close the gap and offer an exponential improvement to the over-parameterization requirement for the existence of lottery tickets. We show that any target network of width d and depth l can be approximated by pruning a random network that is a factor O(log(dl)) wider and twice as deep.
- DanielFilan 22 Apr 2021 16:59 UTC
  4 points
  0
  AF Parent
  None of those quotes claim that training just reinforces the ‘winning tickets’. Also those are referred to as the “strong” or “multi-ticket” LTH.
  - evhub 22 Apr 2021 18:48 UTC
    LW: 4 AF: 3
    0
    AF Parent
    *multi-prize
    - DanielFilan 23 Apr 2021 17:36 UTC
      LW: 2 AF: 1
      0
      AF Parent
      Oops
  - Daniel Kokotajlo 22 Apr 2021 19:11 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Yeah, fair enough. I should amend the title of the question. Re: reinforcing the winning tickets: Isn’t that implied? If it’s not implied, would you not agree that it is happening? Plausibly, if there is a ticket at the beginning that does well at the task, and a ticket at the end that does well at the task, it’s reasonable to think that it’s the same ticket? Idk, I’m open to alternative suggestions now that you mention it...
    - DanielFilan 23 Apr 2021 17:36 UTC
      LW: 4 AF: 3
      0
      AF Parent
      
      Re: reinforcing the winning tickets: Isn’t that implied? If it’s not implied, would you not agree that it is happening?
      
      I don’t think it’s implied, and I’m not confident that it’s happening. There are lots of neural networks!
      - Daniel Kokotajlo 23 Apr 2021 21:01 UTC
        LW: 2 AF: 2
        0
        AF Parent
        Hmmm, ok. Can you say more about why? Isn’t the simplest explanation that the two tickets are the same?
        DanielFilan 23 Apr 2021 21:21 UTC
        LW: 2 AF: 1
        0
        AF Parent
        I expect that there are probably a bunch of different neural networks that perform well at a given task. We sort of know this because you can train a dense neural network to high accuracy, and also prune it to get a definitely-different neural network that also has high accuracy. Is it the case that these sparse architectures are small enough that there’s only one optimum? Maybe, but IDK why I’d expect that.
        Daniel Kokotajlo 24 Apr 2021 6:20 UTC
        LW: 2 AF: 2
        0
        AF Parent
        Whoa, the thing you are arguing against is not at all what I had been saying—but maybe it was implied by what I was saying and I just didn’t realize it? I totally agree that there are many optima, not just one. Maybe we are talking past each other?
        (Part of why I think the two tickets are the same is that the at-initialization ticket is found by taking the after-training ticket and rewinding it to the beginning! So for them not to be the same, the training process would need to kill the first ticket and then build a new ticket on exactly the same spot!)
        DanielFilan 24 Apr 2021 18:05 UTC
        LW: 4 AF: 3
        0
        AF Parent
        I guess I’m imagining that ‘by default’, your distribution over which optimum SGD reaches should be basically uniform, and you need a convincing story to end up believing that it reliably gets to one specific optimum.
        
        So for them not to be the same, the training process would need to kill the first ticket and then build a new ticket on exactly the same spot!
        
        Yes, that’s exactly what I think happens. Training takes a long time, and I expect the weights in a ‘ticket’ to change based on the weights of the rest of the network (since those other weights have similar magnitude). I think the best way to see why I think that is to manually run thru the backpropagation algorithm.
        
        If I’m wrong, it’s probably because of this paper that I don’t have time to read over right now (but that I do recommend you read).
        DanielFilan 24 Apr 2021 18:21 UTC
        LW: 10 AF: 7
        0
        AF Parent
        Oh here’s where I think things went wrong:
        
        Part of why I think the two tickets are the same is that the at-initialization ticket is found by taking the after-training ticket and rewinding it to the beginning!
        
        This is true in the original LTH paper, but there the “at-initialization ticket” doesn’t actually perform well: it’s just easy to train to high performance.
        
        In the multi-prize LTH paper, it is the case that the “at-initialization ticket” performs well, but they don’t find it by winding back the weights of a trained pruned network.
        
        If you got multi-prize at-initialization tickets by winding back the weights of a trained pruned network, I would find that pretty convincing—the idea that they’d be totally different networks would seem like too much of a coincidence. But I would still want to actually check whether the weights were actually the same (which funnily enough isn’t trivial if you’re not familiar with a little-discussed symmetry of DNNs: for a hidden layer neuron with a ReLU activation function, you can scale the input weights up by a positive constant and the output weights down by the same constant without changing the functioning of the network).
        What links here?
        Exploring the Lottery Ticket Hypothesis by Rauno Arike (25 Apr 2023 20:06 UTC; 58 points)
        Daniel Kokotajlo 25 Apr 2021 13:10 UTC
        LW: 2 AF: 2
        0
        AF Parent
        OH this indeed changes everything (about what I had been thinking) thank you! I shall have to puzzle over these ideas some more then, and probably read the multi-prize paper more closely (I only skimmed it earlier)
        DanielFilan 26 Apr 2021 22:26 UTC
        LW: 4 AF: 3
        0
        AF Parent
        Ah to be clear I am entirely basing my comments off of reading the abstracts (and skimming the multi-prize paper with an eye one develops after having been a ML PhD student for mumbles indistinctly years).