The claim I am making is that the reason why feature learning is good is not because it improves inductive bias—it is because it allows the network to be compressed. That is probably at the core of our disagreement.
Yes, I think so. Let’s go over the ‘thin network’ example—we want to learn some function which can be represented by a thin network. But let’s say a randomly-initialized thin network’s intermediate functions won’t be able to fit the function—that is (with high probability over the random initialization) we won’t be able to fit the function just by changing the parameters of the last layer. It seems there are a few ways we can alter the network to make fitting possible:
(A) Expand the network’s width until (with high probability) it’s possible to fit the function by only altering the last layer
(B) Keeping the width the same, re-sample the parameters in all layers until we find a setting that can fit the function
(C) Keeping the width the same, train the network with SGD
By hypothesis, all three methods will let us fit the target function. You seem to be saying[I think, correct me if I’m wrong] that all three methods should have the same inductive bias as well. I just don’t see any reason this should be the case—on the face of it, I would guess that all three have different inductive biases(though A and B might be similar). They’re clearly different in some respects -- (C) can do transfer learning but (A) cannot(B is unclear).
What do we know about SGD-trained nets that suggests this?
My intuition here is that SGD-trained nets can learn functions non-linearly while NTK/GP can only do so linearly. So in the car detector example, SGD is able to develop a neuron detecting cars through some as-yet unclear ‘feature learning’ mechanism. The NTK/GP can do so as well, sort of, since they’re universal function approximators. However, the way they do this is by taking a giant linear combination of random functions which is able to function identically to a car detector on the data points given. It seems like this might be more fragile/generalize worse than the neurons produced by SGD. Though that is admittedly somewhat conjectural at this stage, since we don’t really have a great understanding of how feature learning in SGD works.
I’ve read the new feature learning paper! We’re big fans of his work, although again I don’t think it contradicts anything I’ve just said.
ETA: Let me elaborate upon what I see as the significance of the ‘feature learning in infinite nets’ paper. We know that NNGP/NTK models can’t learn features, but SGD can: I think this provides strong evidence that they are learning using different mechanisms, and likely have substantially different inductive biases. The question is whether randomly sampled finite nets can learn features as well. Since they are equivalent to NNGP/NTK at infinite width, any feature learning they do can only come from finiteness. In contrast, in the case of SGD, it’s possible to do feature learning even in the infinite-width limit. This suggests that even if randomly-sampled finite nets can do feature learning, the mechanism by which they do so is different from SGD, and hence their inductive bias is likely to be different as well.
I’d like to add some points to this interesting discussion:
As far as I understand, feature learning is not necessary for some standard types of transfer learning. E.g.: one can train an NNGP on a large dataset, and then use the learned posterior as prior for “fine-tuning” on some new dataset. This is hard to scale using actual GP techniques, but if wide neural nets (with random sampling or SGD) do approximate NNGPs, this could be a way they achieve transfer learning without feature learning.
You say
In contrast, in the case of SGD, it’s possible to do feature learning even in the infinite-width limit
That is true, but one of the points in Greg Yang’s paper, as far as I remember, was also to say that people weren’t using the scaling limit that would lead to that. That has made me wonder whether feature learning may be happening in our biggests models or not. The work on multimodal neurons in CLIP suggests there is feature learning. But what about GPT-3? In any case, I don’t think it’ll be happening by the mechanism Yang proposes as people aren’t using his initialization scheme. Perhaps, then the mechanism by which finite randomly-sampled NNs could conceivably feature-learn, could be the same as the one SGD is using. I am not sure either way. For me to evaluate the empirical evidence better, I’d need a sense about whether the evidence we have is in sufficiently large models or not (as I do think that randomly-sampled NNs for infinite width won’t do feature learning—though I’m not sure how to prove that, without a better definition of feature learning).
Another point is in answer to your comment that NNGP often underpeforms NTK. I think there’s actually more evidence on the contrary (see https://arxiv.org/abs/2007.15801 ), even if there’re instancs of both ways.
Overall, I think the work in Jascha Sohl-Dickstein’s groun (e.g. the paper linked above) has been great for disentangling these issues, and they seem to point at a complex/nuanced picture, which really leads me to believe we don’t have a clear answer about whether NNGPs will be a good model of SGD in practice (as of today; practice may also change). However, my general observation is that I’m not aware of any evidence that shows that SGD-trained nets beat architecture-equivalent NNGPs by a significant margin, consistently over a wide range of tasks in practice. Chris’ work on Bayesian picture of SGD tried to do this, but the problems are indeed, not quite large enough to be confident. In here https://arxiv.org/abs/2012.04115 we also explore NNGPs (but through a different lens), over SOTA architectures, but still small tasks. So I think the question still remains open as to how would NNGPs perform for more complex datasets.
By hypothesis, all three methods will let us fit the target function. You seem to be saying[I think, correct me if I’m wrong] that all three methods should have the same inductive bias as well.
Not exactly the same—it is known that there is a width dependence on inductive biases. I believe that typically wide networks are better, although I know of some counterexamples.
They’re clearly different in some respects -- (C) can do transfer learning but (A) cannot
I think this is the main source of our disagreement. First of all, while the posterior of an NNGP is equivalent to that of a trained-by-random-sampling infinitely wide NN, it does not contain all the same information. It is a collapsed version of an infinitely wide neural network that does not contain any information about the weights in each layer. This was one of Greg Yang’s points—by definition, a kernel method cannot learn features as you are ignoring the effects of the initial layers, as from a function perspective they are irrelevant—in other words, you have just thrown that information away.
This is not the same as saying that an extremely wide trained-by-random-sampling neural network would not learn features—there is a possibility that the first time you reach 100% training accuracy corresponds to effectively randomly initialised initial layers + trained last layer, but in expectation all the layers should be distinct from an entirely random intialisation.
(B is unclear).
Assuming that the network is so compressed that it can barely represent the true function without substantial fine-tuning of weights in all layers, weights in early layers would absolutely have to be very different from random initialisation.
However, the way they do this is by taking a giant linear combination of random functions which is able to function identically to a car detector on the data points given. It seems like this might be more fragile/generalize worse than the neurons produced by SGD. Though that is admittedly somewhat conjectural at this stage, since we don’t really have a great understanding of how feature learning in SGD works.
You can make arguments that this is what would happen for very wide networks—but then SGD is probably doing the same thing, unless you’re assuming that it learns a few (e.g.) car detector neurons and then the rest are completely redundant. I would expect the car detector neurons to show up in narrower networks, but by my point immediately above, I don’t see why this has to be an SGD-only property.
My intuition here is that SGD-trained nets can learn functions non-linearly while NTK/GP can only do so linearly.
Yes but again an NNGP has thrown away all information about the weights. The NTK limit effectively passes all the gradient to the last layer, so again, by definition, it is a linear model.
Since they are equivalent to NNGP/NTK at infinite width, any feature learning they do can only come from finiteness. In contrast, in the case of SGD, it’s possible to do feature learning even in the infinite-width limit.
Same point as above. The Greg Yang paper shows you need to do the clever reparameterisation to make sure not all the gradient gets passed to the last layer (as it does in NTK). The NNGP flattens the neural network so again there can be no feature learning by that representation. So I think the conclusion “can only come from finiteness” is wrong. The second point is correct, but only because you haven’t collapsed the network into a kernel. If you were to take an extremely wide neural network and train the whole thing by random sampling with some extra steps (e.g. encouraging orthogonality of intermediate outputs between different classes), I don’t see why you wouldn’t have some degree of ‘feature learning’ here.
Perhaps this is a physicist vs mathematician type of thinking though. I think I see where you are coming from, but I don’t think the no feature learning arguments are valid, as I think I outlined.
Perhaps this is a physicist vs mathematician type of thinking though
Good guess ;)
This is not the same as saying that an extremely wide trained-by-random-sampling neural network would not learn features—there is a possibility that the first time you reach 100% training accuracy corresponds to effectively randomly initialised initial layers + trained last layer, but in expectation all the layers should be distinct from an entirely random intialisation.
I see—so you’re saying that even though the distribution of output functions learned by an infinitely-wide randomly-sampled net is unchanged if you freeze everything but the last layer, the distribution of intermediate functions might change. If true, this would mean that feature learning and inductive bias are ‘uncoupled’ for infinite-width randomly-sampled nets. I think this is false, however—that is, I think it’s provable that the distribution of intermediate functions does not change in the infinite-width limit when you condition on the training data, even when conditioning over all layers. I can’t find a reference offhand though, I’ll report back if I find anything resolving this one way or another.
Haha some things are pretty obvious—it’s always really nice to get a very different perspective on an idea, thank you for continuing the conversation!
I see—so you’re saying that even though the distribution of output functions learned by an infinitely-wide randomly-sampled net is unchanged if you freeze everything but the last layer, the distribution of intermediate functions might change. If true, this would mean that feature learning and inductive bias are ‘uncoupled’ for infinite randomly-sampled nets
That is exactly what I’m saying. I don’t know if it is testable in practice, but it is in theory … I would be very interested to see anything about this—let me know if you find anything!
If it turns out that, in the limit of infinite width, feature learning does not work, what are your thoughts about my case for feature learning for the narrow (but trained-by-random-sampling) case? I would guess you find this case significantly more compelling than the infinite width case?
I just came across this paper which derives an expression for the posterior distribution of the weights in each layer in the infinite-width limit. The result: the distribution is unchanged from the prior in every layer but the last. So it indeed seems that there is no feature learning in this limit.
Yes, I think so. Let’s go over the ‘thin network’ example—we want to learn some function which can be represented by a thin network. But let’s say a randomly-initialized thin network’s intermediate functions won’t be able to fit the function—that is (with high probability over the random initialization) we won’t be able to fit the function just by changing the parameters of the last layer. It seems there are a few ways we can alter the network to make fitting possible:
(A) Expand the network’s width until (with high probability) it’s possible to fit the function by only altering the last layer
(B) Keeping the width the same, re-sample the parameters in all layers until we find a setting that can fit the function
(C) Keeping the width the same, train the network with SGD
By hypothesis, all three methods will let us fit the target function. You seem to be saying[I think, correct me if I’m wrong] that all three methods should have the same inductive bias as well. I just don’t see any reason this should be the case—on the face of it, I would guess that all three have different inductive biases(though A and B might be similar). They’re clearly different in some respects -- (C) can do transfer learning but (A) cannot(B is unclear).
My intuition here is that SGD-trained nets can learn functions non-linearly while NTK/GP can only do so linearly. So in the car detector example, SGD is able to develop a neuron detecting cars through some as-yet unclear ‘feature learning’ mechanism. The NTK/GP can do so as well, sort of, since they’re universal function approximators. However, the way they do this is by taking a giant linear combination of random functions which is able to function identically to a car detector on the data points given. It seems like this might be more fragile/generalize worse than the neurons produced by SGD. Though that is admittedly somewhat conjectural at this stage, since we don’t really have a great understanding of how feature learning in SGD works.
ETA: Let me elaborate upon what I see as the significance of the ‘feature learning in infinite nets’ paper. We know that NNGP/NTK models can’t learn features, but SGD can: I think this provides strong evidence that they are learning using different mechanisms, and likely have substantially different inductive biases. The question is whether randomly sampled finite nets can learn features as well. Since they are equivalent to NNGP/NTK at infinite width, any feature learning they do can only come from finiteness. In contrast, in the case of SGD, it’s possible to do feature learning even in the infinite-width limit. This suggests that even if randomly-sampled finite nets can do feature learning, the mechanism by which they do so is different from SGD, and hence their inductive bias is likely to be different as well.
I’d like to add some points to this interesting discussion:
As far as I understand, feature learning is not necessary for some standard types of transfer learning. E.g.: one can train an NNGP on a large dataset, and then use the learned posterior as prior for “fine-tuning” on some new dataset. This is hard to scale using actual GP techniques, but if wide neural nets (with random sampling or SGD) do approximate NNGPs, this could be a way they achieve transfer learning without feature learning.
You say
That is true, but one of the points in Greg Yang’s paper, as far as I remember, was also to say that people weren’t using the scaling limit that would lead to that. That has made me wonder whether feature learning may be happening in our biggests models or not. The work on multimodal neurons in CLIP suggests there is feature learning. But what about GPT-3? In any case, I don’t think it’ll be happening by the mechanism Yang proposes as people aren’t using his initialization scheme. Perhaps, then the mechanism by which finite randomly-sampled NNs could conceivably feature-learn, could be the same as the one SGD is using. I am not sure either way. For me to evaluate the empirical evidence better, I’d need a sense about whether the evidence we have is in sufficiently large models or not (as I do think that randomly-sampled NNs for infinite width won’t do feature learning—though I’m not sure how to prove that, without a better definition of feature learning).
Another point is in answer to your comment that NNGP often underpeforms NTK. I think there’s actually more evidence on the contrary (see https://arxiv.org/abs/2007.15801 ), even if there’re instancs of both ways.
Overall, I think the work in Jascha Sohl-Dickstein’s groun (e.g. the paper linked above) has been great for disentangling these issues, and they seem to point at a complex/nuanced picture, which really leads me to believe we don’t have a clear answer about whether NNGPs will be a good model of SGD in practice (as of today; practice may also change). However, my general observation is that I’m not aware of any evidence that shows that SGD-trained nets beat architecture-equivalent NNGPs by a significant margin, consistently over a wide range of tasks in practice. Chris’ work on Bayesian picture of SGD tried to do this, but the problems are indeed, not quite large enough to be confident. In here https://arxiv.org/abs/2012.04115 we also explore NNGPs (but through a different lens), over SOTA architectures, but still small tasks. So I think the question still remains open as to how would NNGPs perform for more complex datasets.
Not exactly the same—it is known that there is a width dependence on inductive biases. I believe that typically wide networks are better, although I know of some counterexamples.
I think this is the main source of our disagreement. First of all, while the posterior of an NNGP is equivalent to that of a trained-by-random-sampling infinitely wide NN, it does not contain all the same information. It is a collapsed version of an infinitely wide neural network that does not contain any information about the weights in each layer. This was one of Greg Yang’s points—by definition, a kernel method cannot learn features as you are ignoring the effects of the initial layers, as from a function perspective they are irrelevant—in other words, you have just thrown that information away.
This is not the same as saying that an extremely wide trained-by-random-sampling neural network would not learn features—there is a possibility that the first time you reach 100% training accuracy corresponds to effectively randomly initialised initial layers + trained last layer, but in expectation all the layers should be distinct from an entirely random intialisation.
Assuming that the network is so compressed that it can barely represent the true function without substantial fine-tuning of weights in all layers, weights in early layers would absolutely have to be very different from random initialisation.
You can make arguments that this is what would happen for very wide networks—but then SGD is probably doing the same thing, unless you’re assuming that it learns a few (e.g.) car detector neurons and then the rest are completely redundant. I would expect the car detector neurons to show up in narrower networks, but by my point immediately above, I don’t see why this has to be an SGD-only property.
Yes but again an NNGP has thrown away all information about the weights. The NTK limit effectively passes all the gradient to the last layer, so again, by definition, it is a linear model.
Same point as above. The Greg Yang paper shows you need to do the clever reparameterisation to make sure not all the gradient gets passed to the last layer (as it does in NTK). The NNGP flattens the neural network so again there can be no feature learning by that representation. So I think the conclusion “can only come from finiteness” is wrong. The second point is correct, but only because you haven’t collapsed the network into a kernel. If you were to take an extremely wide neural network and train the whole thing by random sampling with some extra steps (e.g. encouraging orthogonality of intermediate outputs between different classes), I don’t see why you wouldn’t have some degree of ‘feature learning’ here.
Perhaps this is a physicist vs mathematician type of thinking though. I think I see where you are coming from, but I don’t think the no feature learning arguments are valid, as I think I outlined.
Good guess ;)
I see—so you’re saying that even though the distribution of output functions learned by an infinitely-wide randomly-sampled net is unchanged if you freeze everything but the last layer, the distribution of intermediate functions might change. If true, this would mean that feature learning and inductive bias are ‘uncoupled’ for infinite-width randomly-sampled nets. I think this is false, however—that is, I think it’s provable that the distribution of intermediate functions does not change in the infinite-width limit when you condition on the training data, even when conditioning over all layers. I can’t find a reference offhand though, I’ll report back if I find anything resolving this one way or another.
Haha some things are pretty obvious—it’s always really nice to get a very different perspective on an idea, thank you for continuing the conversation!
That is exactly what I’m saying. I don’t know if it is testable in practice, but it is in theory … I would be very interested to see anything about this—let me know if you find anything!
If it turns out that, in the limit of infinite width, feature learning does not work, what are your thoughts about my case for feature learning for the narrow (but trained-by-random-sampling) case? I would guess you find this case significantly more compelling than the infinite width case?
I just came across this paper which derives an expression for the posterior distribution of the weights in each layer in the infinite-width limit. The result: the distribution is unchanged from the prior in every layer but the last. So it indeed seems that there is no feature learning in this limit.