By hypothesis, all three methods will let us fit the target function. You seem to be saying[I think, correct me if I’m wrong] that all three methods should have the same inductive bias as well.
Not exactly the same—it is known that there is a width dependence on inductive biases. I believe that typically wide networks are better, although I know of some counterexamples.
They’re clearly different in some respects -- (C) can do transfer learning but (A) cannot
I think this is the main source of our disagreement. First of all, while the posterior of an NNGP is equivalent to that of a trained-by-random-sampling infinitely wide NN, it does not contain all the same information. It is a collapsed version of an infinitely wide neural network that does not contain any information about the weights in each layer. This was one of Greg Yang’s points—by definition, a kernel method cannot learn features as you are ignoring the effects of the initial layers, as from a function perspective they are irrelevant—in other words, you have just thrown that information away.
This is not the same as saying that an extremely wide trained-by-random-sampling neural network would not learn features—there is a possibility that the first time you reach 100% training accuracy corresponds to effectively randomly initialised initial layers + trained last layer, but in expectation all the layers should be distinct from an entirely random intialisation.
(B is unclear).
Assuming that the network is so compressed that it can barely represent the true function without substantial fine-tuning of weights in all layers, weights in early layers would absolutely have to be very different from random initialisation.
However, the way they do this is by taking a giant linear combination of random functions which is able to function identically to a car detector on the data points given. It seems like this might be more fragile/generalize worse than the neurons produced by SGD. Though that is admittedly somewhat conjectural at this stage, since we don’t really have a great understanding of how feature learning in SGD works.
You can make arguments that this is what would happen for very wide networks—but then SGD is probably doing the same thing, unless you’re assuming that it learns a few (e.g.) car detector neurons and then the rest are completely redundant. I would expect the car detector neurons to show up in narrower networks, but by my point immediately above, I don’t see why this has to be an SGD-only property.
My intuition here is that SGD-trained nets can learn functions non-linearly while NTK/GP can only do so linearly.
Yes but again an NNGP has thrown away all information about the weights. The NTK limit effectively passes all the gradient to the last layer, so again, by definition, it is a linear model.
Since they are equivalent to NNGP/NTK at infinite width, any feature learning they do can only come from finiteness. In contrast, in the case of SGD, it’s possible to do feature learning even in the infinite-width limit.
Same point as above. The Greg Yang paper shows you need to do the clever reparameterisation to make sure not all the gradient gets passed to the last layer (as it does in NTK). The NNGP flattens the neural network so again there can be no feature learning by that representation. So I think the conclusion “can only come from finiteness” is wrong. The second point is correct, but only because you haven’t collapsed the network into a kernel. If you were to take an extremely wide neural network and train the whole thing by random sampling with some extra steps (e.g. encouraging orthogonality of intermediate outputs between different classes), I don’t see why you wouldn’t have some degree of ‘feature learning’ here.
Perhaps this is a physicist vs mathematician type of thinking though. I think I see where you are coming from, but I don’t think the no feature learning arguments are valid, as I think I outlined.
Perhaps this is a physicist vs mathematician type of thinking though
Good guess ;)
This is not the same as saying that an extremely wide trained-by-random-sampling neural network would not learn features—there is a possibility that the first time you reach 100% training accuracy corresponds to effectively randomly initialised initial layers + trained last layer, but in expectation all the layers should be distinct from an entirely random intialisation.
I see—so you’re saying that even though the distribution of output functions learned by an infinitely-wide randomly-sampled net is unchanged if you freeze everything but the last layer, the distribution of intermediate functions might change. If true, this would mean that feature learning and inductive bias are ‘uncoupled’ for infinite-width randomly-sampled nets. I think this is false, however—that is, I think it’s provable that the distribution of intermediate functions does not change in the infinite-width limit when you condition on the training data, even when conditioning over all layers. I can’t find a reference offhand though, I’ll report back if I find anything resolving this one way or another.
Haha some things are pretty obvious—it’s always really nice to get a very different perspective on an idea, thank you for continuing the conversation!
I see—so you’re saying that even though the distribution of output functions learned by an infinitely-wide randomly-sampled net is unchanged if you freeze everything but the last layer, the distribution of intermediate functions might change. If true, this would mean that feature learning and inductive bias are ‘uncoupled’ for infinite randomly-sampled nets
That is exactly what I’m saying. I don’t know if it is testable in practice, but it is in theory … I would be very interested to see anything about this—let me know if you find anything!
If it turns out that, in the limit of infinite width, feature learning does not work, what are your thoughts about my case for feature learning for the narrow (but trained-by-random-sampling) case? I would guess you find this case significantly more compelling than the infinite width case?
I just came across this paper which derives an expression for the posterior distribution of the weights in each layer in the infinite-width limit. The result: the distribution is unchanged from the prior in every layer but the last. So it indeed seems that there is no feature learning in this limit.
Not exactly the same—it is known that there is a width dependence on inductive biases. I believe that typically wide networks are better, although I know of some counterexamples.
I think this is the main source of our disagreement. First of all, while the posterior of an NNGP is equivalent to that of a trained-by-random-sampling infinitely wide NN, it does not contain all the same information. It is a collapsed version of an infinitely wide neural network that does not contain any information about the weights in each layer. This was one of Greg Yang’s points—by definition, a kernel method cannot learn features as you are ignoring the effects of the initial layers, as from a function perspective they are irrelevant—in other words, you have just thrown that information away.
This is not the same as saying that an extremely wide trained-by-random-sampling neural network would not learn features—there is a possibility that the first time you reach 100% training accuracy corresponds to effectively randomly initialised initial layers + trained last layer, but in expectation all the layers should be distinct from an entirely random intialisation.
Assuming that the network is so compressed that it can barely represent the true function without substantial fine-tuning of weights in all layers, weights in early layers would absolutely have to be very different from random initialisation.
You can make arguments that this is what would happen for very wide networks—but then SGD is probably doing the same thing, unless you’re assuming that it learns a few (e.g.) car detector neurons and then the rest are completely redundant. I would expect the car detector neurons to show up in narrower networks, but by my point immediately above, I don’t see why this has to be an SGD-only property.
Yes but again an NNGP has thrown away all information about the weights. The NTK limit effectively passes all the gradient to the last layer, so again, by definition, it is a linear model.
Same point as above. The Greg Yang paper shows you need to do the clever reparameterisation to make sure not all the gradient gets passed to the last layer (as it does in NTK). The NNGP flattens the neural network so again there can be no feature learning by that representation. So I think the conclusion “can only come from finiteness” is wrong. The second point is correct, but only because you haven’t collapsed the network into a kernel. If you were to take an extremely wide neural network and train the whole thing by random sampling with some extra steps (e.g. encouraging orthogonality of intermediate outputs between different classes), I don’t see why you wouldn’t have some degree of ‘feature learning’ here.
Perhaps this is a physicist vs mathematician type of thinking though. I think I see where you are coming from, but I don’t think the no feature learning arguments are valid, as I think I outlined.
Good guess ;)
I see—so you’re saying that even though the distribution of output functions learned by an infinitely-wide randomly-sampled net is unchanged if you freeze everything but the last layer, the distribution of intermediate functions might change. If true, this would mean that feature learning and inductive bias are ‘uncoupled’ for infinite-width randomly-sampled nets. I think this is false, however—that is, I think it’s provable that the distribution of intermediate functions does not change in the infinite-width limit when you condition on the training data, even when conditioning over all layers. I can’t find a reference offhand though, I’ll report back if I find anything resolving this one way or another.
Haha some things are pretty obvious—it’s always really nice to get a very different perspective on an idea, thank you for continuing the conversation!
That is exactly what I’m saying. I don’t know if it is testable in practice, but it is in theory … I would be very interested to see anything about this—let me know if you find anything!
If it turns out that, in the limit of infinite width, feature learning does not work, what are your thoughts about my case for feature learning for the narrow (but trained-by-random-sampling) case? I would guess you find this case significantly more compelling than the infinite width case?
I just came across this paper which derives an expression for the posterior distribution of the weights in each layer in the infinite-width limit. The result: the distribution is unchanged from the prior in every layer but the last. So it indeed seems that there is no feature learning in this limit.