IIUC, here’s a simple way to test this hypothesis: initialize a random neural network, and then find the minimal loss point in the tangent space. Since the tangent space is linear, this is easy to do (i.e. doesn’t require heuristic gradient descent): for square loss it’s just solving a large linear system once, for many other losses it should amount to convex optimization for which we have provable efficient algorithms. And, I guess it’s underdetermined so you add some regularization. Is the result about as good as normal gradient descent in the actual parameter space? I’m guessing some of the linked papers might have done something like this?
Yup, people have done this(taking the infinite-width limit at the same time): see here, here. Generally the kernels do worse than the original networks, but not by a lot. On the other hand, they’re usually applied to problems that aren’t super-hard, where non-neural-net classifiers already worked pretty well. And these models definitely can’t explain feature learning, since the functions computed by individual neurons don’t change at all during training.
This basically matches my current understanding. (Though I’m not strongly confident in my current understanding.) I believe the GP results are basically equivalent to this, but I haven’t read up on the topic enough to be sure.
IIUC, here’s a simple way to test this hypothesis: initialize a random neural network, and then find the minimal loss point in the tangent space. Since the tangent space is linear, this is easy to do (i.e. doesn’t require heuristic gradient descent): for square loss it’s just solving a large linear system once, for many other losses it should amount to convex optimization for which we have provable efficient algorithms. And, I guess it’s underdetermined so you add some regularization. Is the result about as good as normal gradient descent in the actual parameter space? I’m guessing some of the linked papers might have done something like this?
Yup, people have done this(taking the infinite-width limit at the same time): see here, here. Generally the kernels do worse than the original networks, but not by a lot. On the other hand, they’re usually applied to problems that aren’t super-hard, where non-neural-net classifiers already worked pretty well. And these models definitely can’t explain feature learning, since the functions computed by individual neurons don’t change at all during training.
This basically matches my current understanding. (Though I’m not strongly confident in my current understanding.) I believe the GP results are basically equivalent to this, but I haven’t read up on the topic enough to be sure.