Lee Sharkey comments on Neural Tangent Kernel Distillation

Lee Sharkey 17 Oct 2022 11:50 UTC
2 points
0
This equation describes (almost) linear regression on a particular feature space $ϕ (x) = \nabla_{θ} f (x, θ_{0})$ :
$\begin{matrix} f_{l i n e a r} (x, θ) & = f (x, θ_{0}) + ϕ (x) \cdot (θ - θ_{0}) \approx ϕ (x) \cdot θ \end{matrix}$
This approximation isn’t obvious to me. It holds if $ f(x, \theta_0) \approx 0 $ and $ \theta_0 \approx 0 $, but these aren’t stated. Are they true?
- Jeremy Gillen 21 Oct 2022 12:03 UTC
  4 points
  0
  Parent
  Yeah good point, I should have put more detail here.
  My understanding is that, for most common initialization distributions and architectures, $f (x, θ_{0}) = 0$ and $ϕ (x) \cdot θ_{0} = 0$ in the infinite width limit. This is because they both end up being expectations of random variables that are symmetrically distributed around 0.
  However, in the finite width regime if we want to be precise, we can simply add those terms back onto the kernel regression.
  So really, with finite width:
  $f_{l i n e a r} (x, θ) = f (x, θ_{0}) + K (x, X) K^{- 1} (X, X) Y - \nabla_{θ} f (x, θ_{0}) \cdot θ_{0}$
  There are a few other very non-rigorous parts of our explanation. Another big one is that $ϕ (x) \cdot θ$ is underspecified by the data in the infinite width limit, so it could fit the data in lots of ways. Stuff about ridge regularized regression and bringing in details about gradient descent fixes this, I believe, but I’m not totally sure whether it changes anything at finite width.
  - Lee Sharkey 22 Oct 2022 16:31 UTC
    1 point
    0
    Parent
    Thanks!