I’d like to add some points to this interesting discussion:
As far as I understand, feature learning is not necessary for some standard types of transfer learning. E.g.: one can train an NNGP on a large dataset, and then use the learned posterior as prior for “fine-tuning” on some new dataset. This is hard to scale using actual GP techniques, but if wide neural nets (with random sampling or SGD) do approximate NNGPs, this could be a way they achieve transfer learning without feature learning.
You say
In contrast, in the case of SGD, it’s possible to do feature learning even in the infinite-width limit
That is true, but one of the points in Greg Yang’s paper, as far as I remember, was also to say that people weren’t using the scaling limit that would lead to that. That has made me wonder whether feature learning may be happening in our biggests models or not. The work on multimodal neurons in CLIP suggests there is feature learning. But what about GPT-3? In any case, I don’t think it’ll be happening by the mechanism Yang proposes as people aren’t using his initialization scheme. Perhaps, then the mechanism by which finite randomly-sampled NNs could conceivably feature-learn, could be the same as the one SGD is using. I am not sure either way. For me to evaluate the empirical evidence better, I’d need a sense about whether the evidence we have is in sufficiently large models or not (as I do think that randomly-sampled NNs for infinite width won’t do feature learning—though I’m not sure how to prove that, without a better definition of feature learning).
Another point is in answer to your comment that NNGP often underpeforms NTK. I think there’s actually more evidence on the contrary (see https://arxiv.org/abs/2007.15801 ), even if there’re instancs of both ways.
Overall, I think the work in Jascha Sohl-Dickstein’s groun (e.g. the paper linked above) has been great for disentangling these issues, and they seem to point at a complex/nuanced picture, which really leads me to believe we don’t have a clear answer about whether NNGPs will be a good model of SGD in practice (as of today; practice may also change). However, my general observation is that I’m not aware of any evidence that shows that SGD-trained nets beat architecture-equivalent NNGPs by a significant margin, consistently over a wide range of tasks in practice. Chris’ work on Bayesian picture of SGD tried to do this, but the problems are indeed, not quite large enough to be confident. In here https://arxiv.org/abs/2012.04115 we also explore NNGPs (but through a different lens), over SOTA architectures, but still small tasks. So I think the question still remains open as to how would NNGPs perform for more complex datasets.
I’d like to add some points to this interesting discussion:
As far as I understand, feature learning is not necessary for some standard types of transfer learning. E.g.: one can train an NNGP on a large dataset, and then use the learned posterior as prior for “fine-tuning” on some new dataset. This is hard to scale using actual GP techniques, but if wide neural nets (with random sampling or SGD) do approximate NNGPs, this could be a way they achieve transfer learning without feature learning.
You say
That is true, but one of the points in Greg Yang’s paper, as far as I remember, was also to say that people weren’t using the scaling limit that would lead to that. That has made me wonder whether feature learning may be happening in our biggests models or not. The work on multimodal neurons in CLIP suggests there is feature learning. But what about GPT-3? In any case, I don’t think it’ll be happening by the mechanism Yang proposes as people aren’t using his initialization scheme. Perhaps, then the mechanism by which finite randomly-sampled NNs could conceivably feature-learn, could be the same as the one SGD is using. I am not sure either way. For me to evaluate the empirical evidence better, I’d need a sense about whether the evidence we have is in sufficiently large models or not (as I do think that randomly-sampled NNs for infinite width won’t do feature learning—though I’m not sure how to prove that, without a better definition of feature learning).
Another point is in answer to your comment that NNGP often underpeforms NTK. I think there’s actually more evidence on the contrary (see https://arxiv.org/abs/2007.15801 ), even if there’re instancs of both ways.
Overall, I think the work in Jascha Sohl-Dickstein’s groun (e.g. the paper linked above) has been great for disentangling these issues, and they seem to point at a complex/nuanced picture, which really leads me to believe we don’t have a clear answer about whether NNGPs will be a good model of SGD in practice (as of today; practice may also change). However, my general observation is that I’m not aware of any evidence that shows that SGD-trained nets beat architecture-equivalent NNGPs by a significant margin, consistently over a wide range of tasks in practice. Chris’ work on Bayesian picture of SGD tried to do this, but the problems are indeed, not quite large enough to be confident. In here https://arxiv.org/abs/2012.04115 we also explore NNGPs (but through a different lens), over SOTA architectures, but still small tasks. So I think the question still remains open as to how would NNGPs perform for more complex datasets.