I’ll confess that I would personally find it kind of disappointing if neural nets were mostly just an efficient way to implement some fixed kernels, when it seems possible that they could be doing something much more interesting—perhaps even implementing something like a simplicity prior over a large class of functions, which I’m pretty sure NTK/GP can’t be
Wait, why can’t NTK/GP be implementing a simplicity prior over a large class of functions? They totally are, it’s just that the prior comes from the measure in random initialization space, rather than from the gradient update process. As explained here. Right? No?
There’s an important distinction[1] to be made between these two claims:
A) Every function with large volume in parameter-space is simple
B) Every simple function has a large volume in parameter space
For a method of inference to qualify as a ‘simplicity prior’, you want both claims to hold. This is what lets us derive bounds like ‘Solomonoff induction matches the performance of any computable predictor’, since all of the simple, computable predictors have relatively large volume in the Solomonoff measure, so they’ll be picked out after boundedly many mistakes. In particular, you want there to be an implication like, if a function has complexity C, it will have parameter-volume at least exp(−βC).
Now, the Mingard results, at least the ones that have mathematical proof, rely on the Levin bound. This only shows (A), which is the direction that is much easier to prove—it automatically holds for any mapping from parameter-space to functions with bounded complexity. They also have some empirical results that show there is substantial ‘clustering’, that is, there are some simple functions that have large volumes. But this still doesn’t show that all of them do, and indeed is compatible with the learnable function class being extremely limited. For instance, this could easily be the case even if NTK/GP was only able to learn linear functions. In reality the NTK/GP is capable of approximating arbitrary functions on finite-dimensional inputs but, as I argued in another comment, this is not the right notion of ‘universality’ for classification problems. I strongly suspect[2] that the NTK/GP can be shown to not be ‘universally data-efficient’ as I outlined there, but as far as I’m aware no one’s looked into the issue formally yet. Empirically, I think the results we have so far suggest that the NTK/GP is a decent first-order approximation for simple tasks that tends to perform worse on the more difficult problems that require non-trivial feature learning/efficiency.
I actually posted basically the same thing underneath another one of your comments a few weeks ago, but maybe you didn’t see it because it was only posted on LW, not the alignment forum
Basically, because in the NTK/GP limit the functions for all the neurons in a given layer are sampled from a single computable distribution, so I think you can show that the embedding is ‘effectively finite’ in some sense(although note it is a universal approximator for fixed input dimension)
Ah, OK. Interesting, thanks. Would you agree with the following view:
“The NTK/GP stuff has neural nets implementing a “psuedosimplicity prior” which is maybe also a simplicity prior but might not be, the evidence is unclear. A psuedosimplicity prior is like a simplicity prior except that there are some important classes of kolmogorov-simple functions that don’t get high prior / high measure.”
Which would you say is more likely: The NTK/GP stuff is indeed not universally data efficient, and thus modern neural nets aren’t either, or (b) NTK/GP stuff is indeed not universally data efficient, and thus modern neural nets aren’t well-characterized by the NTK/GP stuff.
I’d say (b) -- it seems quite unlikely to me that the NTK/GP are universally data-efficient, while neural nets might be(although that’s mostly speculation on my part). I think the lack of feature learning is a stronger argument that NTK/GP don’t characterize neural nets well.
Wait, why can’t NTK/GP be implementing a simplicity prior over a large class of functions? They totally are, it’s just that the prior comes from the measure in random initialization space, rather than from the gradient update process. As explained here. Right? No?
There’s an important distinction[1] to be made between these two claims:
A) Every function with large volume in parameter-space is simple
B) Every simple function has a large volume in parameter space
For a method of inference to qualify as a ‘simplicity prior’, you want both claims to hold. This is what lets us derive bounds like ‘Solomonoff induction matches the performance of any computable predictor’, since all of the simple, computable predictors have relatively large volume in the Solomonoff measure, so they’ll be picked out after boundedly many mistakes. In particular, you want there to be an implication like, if a function has complexity C, it will have parameter-volume at least exp(−βC).
Now, the Mingard results, at least the ones that have mathematical proof, rely on the Levin bound. This only shows (A), which is the direction that is much easier to prove—it automatically holds for any mapping from parameter-space to functions with bounded complexity. They also have some empirical results that show there is substantial ‘clustering’, that is, there are some simple functions that have large volumes. But this still doesn’t show that all of them do, and indeed is compatible with the learnable function class being extremely limited. For instance, this could easily be the case even if NTK/GP was only able to learn linear functions. In reality the NTK/GP is capable of approximating arbitrary functions on finite-dimensional inputs but, as I argued in another comment, this is not the right notion of ‘universality’ for classification problems. I strongly suspect[2] that the NTK/GP can be shown to not be ‘universally data-efficient’ as I outlined there, but as far as I’m aware no one’s looked into the issue formally yet. Empirically, I think the results we have so far suggest that the NTK/GP is a decent first-order approximation for simple tasks that tends to perform worse on the more difficult problems that require non-trivial feature learning/efficiency.
I actually posted basically the same thing underneath another one of your comments a few weeks ago, but maybe you didn’t see it because it was only posted on LW, not the alignment forum
Basically, because in the NTK/GP limit the functions for all the neurons in a given layer are sampled from a single computable distribution, so I think you can show that the embedding is ‘effectively finite’ in some sense(although note it is a universal approximator for fixed input dimension)
Ah, OK. Interesting, thanks. Would you agree with the following view:
“The NTK/GP stuff has neural nets implementing a “psuedosimplicity prior” which is maybe also a simplicity prior but might not be, the evidence is unclear. A psuedosimplicity prior is like a simplicity prior except that there are some important classes of kolmogorov-simple functions that don’t get high prior / high measure.”
Which would you say is more likely: The NTK/GP stuff is indeed not universally data efficient, and thus modern neural nets aren’t either, or (b) NTK/GP stuff is indeed not universally data efficient, and thus modern neural nets aren’t well-characterized by the NTK/GP stuff.
Yeah, that summary sounds right.
I’d say (b) -- it seems quite unlikely to me that the NTK/GP are universally data-efficient, while neural nets might be(although that’s mostly speculation on my part). I think the lack of feature learning is a stronger argument that NTK/GP don’t characterize neural nets well.