Presumably the eigenfunctions are mostly sinusoidal because you’re training against a sinusoid? So it’s not really relevant that “it’s really hard for us to express abstract concepts like ‘is this network deceptive?’ in the language of the kernel eigenfunctions sine wave decomposition”; presumably the eigenfunctions will be quite different for more realistic problems.
Hmm, the eigenfunctions just depend on the input training data distribution (which we call X), and in this experiment, they are distributed evenly on the interval [−π,π). Given that the labels are independent of this, you’ll get the same NTK eigendecomposition regardless of the target function.
I’ll probably spin up some quick experiments in a multiple dimensional input space to see if it looks different, but I would be quite surprised if the eigenfunctions stopped being sinusoidal. Another thing to vary could be the distribution of input points.
Typically the property which induces sinusoidal eigenfunctions is some kind of permutation invariance—e.g. if you can rotate the system without changing the loss function, that should induce sinusoids.
The underlying reason for this:
When two matrices commute, they share an eigenspace. In this case, the “commutation” is between the matrix whose eigenvectors we want, and the permutation.
The eigendecomposition of a permutation matrix is, roughly speaking, a fourier transform, so its eigenvectors are sinusoids.
The kernel matrix K of shape n×n takes in takes in two label vectors and outputs a real number: y⊤Ky. The real number is roughly the negative log prior probability of that label set.
We can make some orthogonal matrix that transforms the labels y, such that the real number output doesn’t change. (Ry)⊤K(Ry)=y⊤Ky
This is a transformation that keeps the label prior probability the same, for any label vector.
(Ry)⊤K(Ry)=y⊤Ky for all y∈Rn iff RK=KR, which implies R and K share the same eigenvectors (with some additional assumption about K having different eigenvalues, which we think should be true in this case).
Therefore we can just find the eigenvectors of R.
But what can R be? If K has multiple eigenvalues that were the same, then we could construct an R that works for all y. But empirically aren’t the eigenvalues of K all different? So we are confused about that.
Also we are confused about this: “without changing the loss function”. We aren’t sure how the loss function comes into it.
Also this: “training against a sinusoid” seems false? Or we really don’t know what this means.
Ignore the part about training against a sinusoid. That was a more specific hypothesis, the symmetry thing is more general. Also ignore the part about “not changing the loss function”, since you’ve got the right math.
I’m a bit confused that you’re calling y a label vector; shouldn’t it be shaped like a data pt? E.g. if I’m training an image classifier, that vector should be image-shaped. And then the typical symmetry we’d expect is that the kernel is (approximately) invariant to shifting the image left, right, up or down a pixel, and we could take any of those shifts to be R.
Presumably the eigenfunctions are mostly sinusoidal because you’re training against a sinusoid? So it’s not really relevant that “it’s really hard for us to express abstract concepts like ‘is this network deceptive?’ in the language of the kernel eigenfunctions sine wave decomposition”; presumably the eigenfunctions will be quite different for more realistic problems.
Hmm, the eigenfunctions just depend on the input training data distribution (which we call X), and in this experiment, they are distributed evenly on the interval [−π,π). Given that the labels are independent of this, you’ll get the same NTK eigendecomposition regardless of the target function.
I’ll probably spin up some quick experiments in a multiple dimensional input space to see if it looks different, but I would be quite surprised if the eigenfunctions stopped being sinusoidal. Another thing to vary could be the distribution of input points.
Typically the property which induces sinusoidal eigenfunctions is some kind of permutation invariance—e.g. if you can rotate the system without changing the loss function, that should induce sinusoids.
The underlying reason for this:
When two matrices commute, they share an eigenspace. In this case, the “commutation” is between the matrix whose eigenvectors we want, and the permutation.
The eigendecomposition of a permutation matrix is, roughly speaking, a fourier transform, so its eigenvectors are sinusoids.
We don’t fully understand this comment.
Our current understanding is this:
The kernel matrix K of shape n×n takes in takes in two label vectors and outputs a real number: y⊤Ky. The real number is roughly the negative log prior probability of that label set.
We can make some orthogonal matrix that transforms the labels y, such that the real number output doesn’t change. (Ry)⊤K(Ry)=y⊤Ky
This is a transformation that keeps the label prior probability the same, for any label vector.
(Ry)⊤K(Ry)=y⊤Ky for all y∈Rn iff RK=KR, which implies R and K share the same eigenvectors (with some additional assumption about K having different eigenvalues, which we think should be true in this case).
Therefore we can just find the eigenvectors of R.
But what can R be? If K has multiple eigenvalues that were the same, then we could construct an R that works for all y. But empirically aren’t the eigenvalues of K all different?
So we are confused about that.
Also we are confused about this: “without changing the loss function”. We aren’t sure how the loss function comes into it.
Also this: “training against a sinusoid” seems false? Or we really don’t know what this means.
Ignore the part about training against a sinusoid. That was a more specific hypothesis, the symmetry thing is more general. Also ignore the part about “not changing the loss function”, since you’ve got the right math.
I’m a bit confused that you’re calling y a label vector; shouldn’t it be shaped like a data pt? E.g. if I’m training an image classifier, that vector should be image-shaped. And then the typical symmetry we’d expect is that the kernel is (approximately) invariant to shifting the image left, right, up or down a pixel, and we could take any of those shifts to be R.
The eigenfunctions we are calculating are solutions to:
λϕ(x′)=∫x∼Dk(x′,x)ϕ(x)dx
Where D is the data distribution, λ is an eigenvalue and ϕ(x) is an eigenfunction.
So the eigenfunction is a label function with input x, a datapoint. The discrete approximation to it is a label vector, which I called y above.
I’d expect that as long as the prior favors smoother functions, the eigenfunctions would tend to look sinusoidal?