There is an extensive discussion about feature learning in relation to the aforementioned Mingard et al result in the comments of this post. The conclusion of the discussion was that feature learning is uncoupled from inductive bias for infinite (and actually finite width with further conditons) neural networks when trained by a random-sampling process (essentially how NNGPs work).
The open question is whether the probability distribution over functions after each layer are the same whether you train with SGD or random sampling. Given how the posteriors of optimiser trained NNs are to NNGPs, I think it is sensible to assume that they are similar. However, the important question is still whether this scales to large architectures and datasets, which become computationally much harder to test (as the NNGP kernel becomes harder and harder to compute with size of dataset).
Chris Mingard comments on NTK/GP Models of Neural Nets Can’t Learn Features