Yeah good point, I should have put more detail here.
My understanding is that, for most common initialization distributions and architectures, f(x,θ0)=0 and ϕ(x)⋅θ0=0 in the infinite width limit. This is because they both end up being expectations of random variables that are symmetrically distributed around 0.
However, in the finite width regime if we want to be precise, we can simply add those terms back onto the kernel regression.
So really, with finite width:
flinear(x,θ)=f(x,θ0)+K(x,X)K−1(X,X)Y−∇θf(x,θ0)⋅θ0
There are a few other very non-rigorous parts of our explanation. Another big one is that ϕ(x)⋅θ is underspecified by the data in the infinite width limit, so it could fit the data in lots of ways. Stuff about ridge regularized regression and bringing in details about gradient descent fixes this, I believe, but I’m not totally sure whether it changes anything at finite width.
This approximation isn’t obvious to me. It holds if $ f(x, \theta_0) \approx 0 $ and $ \theta_0 \approx 0 $, but these aren’t stated. Are they true?
Yeah good point, I should have put more detail here.
My understanding is that, for most common initialization distributions and architectures, f(x,θ0)=0 and ϕ(x)⋅θ0=0 in the infinite width limit. This is because they both end up being expectations of random variables that are symmetrically distributed around 0.
However, in the finite width regime if we want to be precise, we can simply add those terms back onto the kernel regression.
So really, with finite width:
flinear(x,θ)=f(x,θ0)+K(x,X)K−1(X,X)Y−∇θf(x,θ0)⋅θ0There are a few other very non-rigorous parts of our explanation. Another big one is that ϕ(x)⋅θ is underspecified by the data in the infinite width limit, so it could fit the data in lots of ways. Stuff about ridge regularized regression and bringing in details about gradient descent fixes this, I believe, but I’m not totally sure whether it changes anything at finite width.
Thanks!