interstice comments on Parsing Chris Mingard on Neural Networks

interstice 15 May 2021 21:13 UTC
3 points

Perhaps this is a physicist vs mathematician type of thinking though

Good guess ;)

This is not the same as saying that an extremely wide trained-by-random-sampling neural network would not learn features—there is a possibility that the first time you reach 100% training accuracy corresponds to effectively randomly initialised initial layers + trained last layer, but in expectation all the layers should be distinct from an entirely random intialisation.

I see—so you’re saying that even though the distribution of output functions learned by an infinitely-wide randomly-sampled net is unchanged if you freeze everything but the last layer, the distribution of intermediate functions might change. If true, this would mean that feature learning and inductive bias are ‘uncoupled’ for infinite-width randomly-sampled nets. I think this is false, however—that is, I think it’s provable that the distribution of intermediate functions does not change in the infinite-width limit when you condition on the training data, even when conditioning over all layers. I can’t find a reference offhand though, I’ll report back if I find anything resolving this one way or another.
- Chris Mingard 15 May 2021 21:25 UTC
  3 points
  Parent
  Good guess ;)
  Haha some things are pretty obvious—it’s always really nice to get a very different perspective on an idea, thank you for continuing the conversation!
  I see—so you’re saying that even though the distribution of output functions learned by an infinitely-wide randomly-sampled net is unchanged if you freeze everything but the last layer, the distribution of intermediate functions might change. If true, this would mean that feature learning and inductive bias are ‘uncoupled’ for infinite randomly-sampled nets
  That is exactly what I’m saying. I don’t know if it is testable in practice, but it is in theory … I would be very interested to see anything about this—let me know if you find anything!
  If it turns out that, in the limit of infinite width, feature learning does not work, what are your thoughts about my case for feature learning for the narrow (but trained-by-random-sampling) case? I would guess you find this case significantly more compelling than the infinite width case?
  - interstice 18 Jun 2022 5:32 UTC
    2 points
    Parent
    I just came across this paper which derives an expression for the posterior distribution of the weights in each layer in the infinite-width limit. The result: the distribution is unchanged from the prior in every layer but the last. So it indeed seems that there is no feature learning in this limit.