johnswentworth comments on Updating the Lottery Ticket Hypothesis

johnswentworth 19 Apr 2021 15:37 UTC
LW: 6 AF: 4
AF
The problem is what we mean by e.g. “dog recognizing subcircuit”. The simplest version would be something like “at initialization, there’s already one neuron which lights up in response to dogs” or something like that. (And that’s basically the version which would be needed in order for a gradient descent process to actually pick out that lottery ticket.) That’s the version which I’d call implausible: function space is superexponentially large, circuit space is smaller but still superexponential, so no neural network is ever going to be large enough to have neurons which light up to match most functions/circuits. I would argue that dog-detectors are a lot more special than random circuits even a priori, but not so much more special that the space-of-functions-that-special is less than exponentially large. (For very small circuits like edge detectors, it’s more plausible that some neurons implement that function right from the start.)
The thing in the paper you linked is doing something different from that. At initialization, the neurons in the subcircuits they’re finding would not light up in recognition of a dog, because they’re still connected to a bunch of other stuff that’s not in the subcircuit—the subcircuit only detects dogs once the other stuff is disconnected. And, IIUC, SGD should not reliably “find” those tickets: because no neurons in the subcircuit are significantly correlated with dogs, SGD doesn’t have any reason to upweight them for dog-recognition. So what’s going on in that paper is different from what’s going on in normal SGD-trained nets (or at least not the full story).
What links here?
- Exploring the Lottery Ticket Hypothesis by Rauno Arike (25 Apr 2023 20:06 UTC; 54 points)
- abramdemski 20 Apr 2021 17:17 UTC
  LW: 8 AF: 7
  AF Parent
  I was in a discussion yesterday that made it seem pretty plausible that you’re wrong—this paper suggests that the over-parameterization needed to ensure that some circuit is (approximately) present at the beginning is not that large.
  function space is superexponentially large, circuit space is smaller but still superexponential, so no neural network is ever going to be large enough to have neurons which light up to match most functions/circuits.
  I haven’t actually read the paper I’m referencing, but my understanding is that this argument doesn’t work out because the number of possible circuits of size N is balanced by the high number of subgraphs in a graph of size M (where M is only logarithmically larger than N).
  That being said, I don’t know whether “present at the beginning” is the same as “easily found by gradient descent”.
  - johnswentworth 20 Apr 2021 17:25 UTC
    LW: 4 AF: 4
    AF Parent
    See my comment here. The paper you link (like others I’ve seen in this vein) requires pruning, which will change the functional behavior of the nodes themselves. As I currently understand it, it is consistent with not having any neuron which lights up to match most functions/circuits.
    - abramdemski 20 Apr 2021 17:58 UTC
      LW: 6 AF: 6
      AF Parent
      Ah, I should have read comments more carefully.
      I agree with your comment that the claim “I doubt that today’s neural networks already contain dog-recognizing subcircuits at initialization” is ambiguous—“contains” can mean “under pruning”.
      Obviously, this is an important distinction: if the would-be dog-recognizing circuit behaves extremely differently due to intersection with a lot of other circuits, it could be much harder to find. But why is “a single neuron lighting up” where you draw the line?
      It seems clear that at least some relaxation of that requirement is tenable. For example, if no one neuron lights up in the correct pattern, but there’s a linear combination of neurons (before the output layer) which does, then it seems we’re good to go: GD could find that pretty easily.
      I guess this is where the tangent space model comes in; if in practice (for large networks) we stay in the tangent space, then a linear combination of neurons is basically exactly as much as we can relax your requirement.
      But without the tangent-space hypothesis, it’s unclear where to draw the line, and your claim that an existing neuron already behaving in the desired way is “what would be necessary for the lottery ticket intuition” isn’t clear to me. (Is there a more obvious argument for this, according to you?)
      - johnswentworth 20 Apr 2021 18:09 UTC
        LW: 4 AF: 4
        AF Parent
        Yeah, I agree that something more general than one neuron but less general than (or at least different from) pruning might be appropriate. I’m not particularly worried about where that line “should” be drawn a priori, because the tangent space indeed seems like the right place to draw the line empirically.
        abramdemski 20 Apr 2021 21:28 UTC
        LW: 6 AF: 6
        AF Parent
        Wait… so:
        The tangent-space hypothesis implies something close to “gd finds a solution if and only if there’s already a dog detecting neuron” (for large networks, that is) -- specifically it seems to imply something pretty close to “there’s already a feature”, where “feature” means a linear combination of existing neurons within a single layer
        gd in fact trains NNs to recognize dogs
        Therefore, we’re still in the territory of “there’s already a dog detector”
        ...yeah?
        interstice 20 Apr 2021 22:40 UTC
        5 points
        Parent
        The tangent-space hypothesis implies something close to this but not quite—instead of ‘dog-detecting neuron’, it’s ‘parameter such that the partial derivative of the output with respect to that parameter, as a function of the input, implements a dog-detector’. This would include (the partial derivative w.r.t.) neurons via their bias.
        johnswentworth 20 Apr 2021 22:32 UTC
        LW: 3 AF: 2
        AF Parent
        Not quite. The linear expansion isn’t just over the parameters associated with one layer, it’s over all the parameters in the whole net.
- DanielFilan 19 Apr 2021 17:12 UTC
  LW: 7 AF: 4
  AF Parent
  One important thing to note here is that the LTH paper doesn’t demonstrate that SGD “finds” a ticket: just that the subnetwork you get by training and pruning could be trained alone in isolation to higher accuracy. That doesn’t mean that the weights in the original training are the same when the network is trained in isolation!
  - DanielFilan 20 Apr 2021 22:29 UTC
    LW: 6 AF: 4
    AF Parent
    So in particular I basically disagree with the opening summary of the content of the “lottery ticket hypothesis”. I think a better summary is found in the abstract of the original paper:
    
    dense, randomly-initialized, feed-forward networks contain subnetworks (winning tickets) that—when trained in isolation—reach test accuracy comparable to the original network in a similar number of iterations
    - johnswentworth 20 Apr 2021 22:36 UTC
      LW: 2 AF: 2
      AF Parent
      Yup, I agree that that quote says something which is probably true, given current evidence. I don’t think “picking a winning lottery ticket” is a good way analogy for what that implies, though; see this comment.
      - DanielFilan 20 Apr 2021 23:44 UTC
        LW: 7 AF: 6
        AF Parent
        
        Yup, I agree that that quote says something which is probably true, given current evidence.
        
        I don’t know what the referent of ‘that quote’ is. If you mean the passage I quoted from the original lottery ticket hypothesis (“LTH”) paper, then I highly recommend reading a follow-up paper which describes how and why it’s wrong for large networks. The abstract of the paper I’m citing here:
        
        We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (e.g., random data order and augmentation). We find that standard vision models become stable to SGD noise in this way early in training. From then on, the outcome of optimization is determined to a linearly connected region. We use this technique to study iterative magnitude pruning (IMP), the procedure used by work on the lottery ticket hypothesis to identify subnetworks that could have trained in isolation to full accuracy. We find that these subnetworks only reach full accuracy when they are stable to SGD noise, which either occurs at initialization for small-scale settings (MNIST) or early in training for large-scale settings (ResNet-50 and Inception-v3 on ImageNet).
        
        I don’t think “picking a winning lottery ticket” is a good way analogy for what that implies
        
        Again, assuming “that” refers to the claim in the original LTH paper, I also don’t think it’s a good analogy. But by default I think that claim is what “the lottery ticket hypothesis” refers to, given that it’s a widely cited paper that has spawned a large number of follow-up works.
        johnswentworth 20 Apr 2021 23:59 UTC
        LW: 2 AF: 2
        AF Parent
        Oh that’s definitely interesting. Thanks for the link.