The kernel matrix K of shape n×n takes in takes in two label vectors and outputs a real number: y⊤Ky. The real number is roughly the negative log prior probability of that label set.
We can make some orthogonal matrix that transforms the labels y, such that the real number output doesn’t change. (Ry)⊤K(Ry)=y⊤Ky
This is a transformation that keeps the label prior probability the same, for any label vector.
(Ry)⊤K(Ry)=y⊤Ky for all y∈Rn iff RK=KR, which implies R and K share the same eigenvectors (with some additional assumption about K having different eigenvalues, which we think should be true in this case).
Therefore we can just find the eigenvectors of R.
But what can R be? If K has multiple eigenvalues that were the same, then we could construct an R that works for all y. But empirically aren’t the eigenvalues of K all different? So we are confused about that.
Also we are confused about this: “without changing the loss function”. We aren’t sure how the loss function comes into it.
Also this: “training against a sinusoid” seems false? Or we really don’t know what this means.
Ignore the part about training against a sinusoid. That was a more specific hypothesis, the symmetry thing is more general. Also ignore the part about “not changing the loss function”, since you’ve got the right math.
I’m a bit confused that you’re calling y a label vector; shouldn’t it be shaped like a data pt? E.g. if I’m training an image classifier, that vector should be image-shaped. And then the typical symmetry we’d expect is that the kernel is (approximately) invariant to shifting the image left, right, up or down a pixel, and we could take any of those shifts to be R.
We don’t fully understand this comment.
Our current understanding is this:
The kernel matrix K of shape n×n takes in takes in two label vectors and outputs a real number: y⊤Ky. The real number is roughly the negative log prior probability of that label set.
We can make some orthogonal matrix that transforms the labels y, such that the real number output doesn’t change. (Ry)⊤K(Ry)=y⊤Ky
This is a transformation that keeps the label prior probability the same, for any label vector.
(Ry)⊤K(Ry)=y⊤Ky for all y∈Rn iff RK=KR, which implies R and K share the same eigenvectors (with some additional assumption about K having different eigenvalues, which we think should be true in this case).
Therefore we can just find the eigenvectors of R.
But what can R be? If K has multiple eigenvalues that were the same, then we could construct an R that works for all y. But empirically aren’t the eigenvalues of K all different?
So we are confused about that.
Also we are confused about this: “without changing the loss function”. We aren’t sure how the loss function comes into it.
Also this: “training against a sinusoid” seems false? Or we really don’t know what this means.
Ignore the part about training against a sinusoid. That was a more specific hypothesis, the symmetry thing is more general. Also ignore the part about “not changing the loss function”, since you’ve got the right math.
I’m a bit confused that you’re calling y a label vector; shouldn’t it be shaped like a data pt? E.g. if I’m training an image classifier, that vector should be image-shaped. And then the typical symmetry we’d expect is that the kernel is (approximately) invariant to shifting the image left, right, up or down a pixel, and we could take any of those shifts to be R.
The eigenfunctions we are calculating are solutions to:
λϕ(x′)=∫x∼Dk(x′,x)ϕ(x)dx
Where D is the data distribution, λ is an eigenvalue and ϕ(x) is an eigenfunction.
So the eigenfunction is a label function with input x, a datapoint. The discrete approximation to it is a label vector, which I called y above.