Thanks for the great post! I have a question, if it’s not too much trouble:
Sorry for my confusion about something so silly, but shouldn’t the following be “when α⩽2”?
When α≥2 there is no place where the derivative vanishes
I’m also a bit confused about why we can think of α as representing “which moment of the interference distribution we care about.”
Perhaps some of my confusion here stems from the fact that it seems to me that the optimal number of subspaces, k=neα/(2−α), is an increasing function of α, which doesn’t seem to line up with the following:
Hence when α is large we want to have fewer subspaces
Sorry for my confusion about something so silly, but shouldn’t the following be “when α⩽2
Oh you’re totally right. And k=1 should be k=d there. I’ll edit in a fix.
I’m also a bit confused about why we can think of α as representing “which moment of the interference distribution we care about.”
It’s not precisely which moment, but as we vary α the moment(s) of interest vary monotonically.
Perhaps some of my confusion here stems from the fact that it seems to me that the optimal number of subspaces, k=neα/(2−α), is an increasing function of α, which doesn’t seem to line up with the following:
This comment turned into a fascinating rabbit hole for me, so thank you!
It turns out that there is another term in the Johnson-Lindenstrauss expression that’s important. Specifically, the relation between ϵ, m, and D should be ϵ2/2−ϵ3/3≥4logm/D (per Scikit and references therein). The numerical constants aren’t important, but the cubic term is, because it means the interference grows rather faster as m grows (especially in the vicinity of ϵ≈1).
With this correction it’s no longer feasible to do things analytically, but we can still do things numerically. The plots below are made with n=105,d=104:
The top panel shows the normalized loss for a few different α≤2, and the lower shows the loss derivative with respect to k. Note that the range of k is set by the real roots of ϵ2/2−ϵ3/3≥4logm/D: for larger k there are no real roots, which corresponds to the interference ϵ crossing unity. In practice this bound applies well before k→d. Intuitively, if there are more vectors than dimensions then the interference becomes order-unity (so there is no information left!) well before the subspace dimension falls to unity.
Anyway, all of these curves have global minima in the interior of the domain (if just barely for α=0.5), and the minima move to the left as α rises. That is, for α≤2 we care increasingly about higher moments as we increase α and so we want fewer subspaces.
What happens for α>2?
The global minima disappear! Now the optimum is always k=1. In fact though the transition is no longer at α=2 but a little higher:
So the basic story still holds, but none of the math involved in finding the optimum applies!
Thanks for the great post! I have a question, if it’s not too much trouble:
Sorry for my confusion about something so silly, but shouldn’t the following be “when α⩽2”?
I’m also a bit confused about why we can think of α as representing “which moment of the interference distribution we care about.”
Perhaps some of my confusion here stems from the fact that it seems to me that the optimal number of subspaces, k=neα/(2−α), is an increasing function of α, which doesn’t seem to line up with the following:
What am I missing here?
Oh you’re totally right. And k=1 should be k=d there. I’ll edit in a fix.
It’s not precisely which moment, but as we vary α the moment(s) of interest vary monotonically.
This comment turned into a fascinating rabbit hole for me, so thank you!
It turns out that there is another term in the Johnson-Lindenstrauss expression that’s important. Specifically, the relation between ϵ, m, and D should be ϵ2/2−ϵ3/3≥4logm/D (per Scikit and references therein). The numerical constants aren’t important, but the cubic term is, because it means the interference grows rather faster as m grows (especially in the vicinity of ϵ≈1).
With this correction it’s no longer feasible to do things analytically, but we can still do things numerically. The plots below are made with n=105,d=104:
The top panel shows the normalized loss for a few different α≤2, and the lower shows the loss derivative with respect to k. Note that the range of k is set by the real roots of ϵ2/2−ϵ3/3≥4logm/D: for larger k there are no real roots, which corresponds to the interference ϵ crossing unity. In practice this bound applies well before k→d. Intuitively, if there are more vectors than dimensions then the interference becomes order-unity (so there is no information left!) well before the subspace dimension falls to unity.
Anyway, all of these curves have global minima in the interior of the domain (if just barely for α=0.5), and the minima move to the left as α rises. That is, for α≤2 we care increasingly about higher moments as we increase α and so we want fewer subspaces.
What happens for α>2?
The global minima disappear! Now the optimum is always k=1. In fact though the transition is no longer at α=2 but a little higher:
So the basic story still holds, but none of the math involved in finding the optimum applies!
I’ll edit the post to make this clear.