johnswentworth comments on Neural Tangent Kernel Distillation

johnswentworth 5 Oct 2022 19:30 UTC
4 points
1
Typically the property which induces sinusoidal eigenfunctions is some kind of permutation invariance—e.g. if you can rotate the system without changing the loss function, that should induce sinusoids.
The underlying reason for this:
- When two matrices commute, they share an eigenspace. In this case, the “commutation” is between the matrix whose eigenvectors we want, and the permutation.
- The eigendecomposition of a permutation matrix is, roughly speaking, a fourier transform, so its eigenvectors are sinusoids.
- Jeremy Gillen 7 Oct 2022 13:04 UTC
  8 points
  1
  Parent
  We don’t fully understand this comment.
  Our current understanding is this:
  - The kernel matrix $K$ of shape $n \times n$ takes in takes in two label vectors and outputs a real number: $y^{⊤} K y$ . The real number is roughly the negative log prior probability of that label set.
  - We can make some orthogonal matrix that transforms the labels $y$ , such that the real number output doesn’t change. $(R y)^{⊤} K (R y) = y^{⊤} K y$
    This is a transformation that keeps the label prior probability the same, for any label vector.
  - $(R y)^{⊤} K (R y) = y^{⊤} K y$ for all $y \in R^{n}$ iff $R K = K R$ , which implies $R$ and $K$ share the same eigenvectors (with some additional assumption about $K$ having different eigenvalues, which we think should be true in this case).
  - Therefore we can just find the eigenvectors of $R$ .
  But what can $R$ be? If $K$ has multiple eigenvalues that were the same, then we could construct an R that works for all $y$ . But empirically aren’t the eigenvalues of K all different?
  So we are confused about that.
  
  Also we are confused about this: “without changing the loss function”. We aren’t sure how the loss function comes into it.
  Also this: “training against a sinusoid” seems false? Or we really don’t know what this means.
  - johnswentworth 7 Oct 2022 16:44 UTC
    3 points
    0
    Parent
    Ignore the part about training against a sinusoid. That was a more specific hypothesis, the symmetry thing is more general. Also ignore the part about “not changing the loss function”, since you’ve got the right math.
    I’m a bit confused that you’re calling $y$ a label vector; shouldn’t it be shaped like a data pt? E.g. if I’m training an image classifier, that vector should be image-shaped. And then the typical symmetry we’d expect is that the kernel is (approximately) invariant to shifting the image left, right, up or down a pixel, and we could take any of those shifts to be R.
    - Jeremy Gillen 10 Oct 2022 14:12 UTC
      3 points
      0
      Parent
      The eigenfunctions we are calculating are solutions to:
      $λ ϕ (x^{'}) = \int_{x \sim D} k (x^{'}, x) ϕ (x) d x$
      Where $D$ is the data distribution, $λ$ is an eigenvalue and $ϕ (x)$ is an eigenfunction.
      So the eigenfunction is a label function with input $x$ , a datapoint. The discrete approximation to it is a label vector, which I called $y$ above.