Haha some things are pretty obvious—it’s always really nice to get a very different perspective on an idea, thank you for continuing the conversation!
I see—so you’re saying that even though the distribution of output functions learned by an infinitely-wide randomly-sampled net is unchanged if you freeze everything but the last layer, the distribution of intermediate functions might change. If true, this would mean that feature learning and inductive bias are ‘uncoupled’ for infinite randomly-sampled nets
That is exactly what I’m saying. I don’t know if it is testable in practice, but it is in theory … I would be very interested to see anything about this—let me know if you find anything!
If it turns out that, in the limit of infinite width, feature learning does not work, what are your thoughts about my case for feature learning for the narrow (but trained-by-random-sampling) case? I would guess you find this case significantly more compelling than the infinite width case?
I just came across this paper which derives an expression for the posterior distribution of the weights in each layer in the infinite-width limit. The result: the distribution is unchanged from the prior in every layer but the last. So it indeed seems that there is no feature learning in this limit.
Haha some things are pretty obvious—it’s always really nice to get a very different perspective on an idea, thank you for continuing the conversation!
That is exactly what I’m saying. I don’t know if it is testable in practice, but it is in theory … I would be very interested to see anything about this—let me know if you find anything!
If it turns out that, in the limit of infinite width, feature learning does not work, what are your thoughts about my case for feature learning for the narrow (but trained-by-random-sampling) case? I would guess you find this case significantly more compelling than the infinite width case?
I just came across this paper which derives an expression for the posterior distribution of the weights in each layer in the infinite-width limit. The result: the distribution is unchanged from the prior in every layer but the last. So it indeed seems that there is no feature learning in this limit.