Ah, that is interesting. This definitely updates me moderately toward the “NTKs don’t learn features” hypothesis.
BTW, does this hypothesis also mean that feature learning should break down in ordinary nets as they scale up? Or does increasing the data alongside the parameter count counteract that?
I think nets are usually increased in depth as well as width when they are ‘scaled up’, so the NTK limit doesn’t apply—the convergence to NTK is controlled by the ratio of depth to width, only approaching a deterministic kernel if this ratio approaches 0.
Ah, that is interesting. This definitely updates me moderately toward the “NTKs don’t learn features” hypothesis.
BTW, does this hypothesis also mean that feature learning should break down in ordinary nets as they scale up? Or does increasing the data alongside the parameter count counteract that?
I think nets are usually increased in depth as well as width when they are ‘scaled up’, so the NTK limit doesn’t apply—the convergence to NTK is controlled by the ratio of depth to width, only approaching a deterministic kernel if this ratio approaches 0.