Thanks for writing this! I found this a really helpful post for clarifying my own intuitions. Trying to operationalise what confused me before, and what now feels clear:
Confusion: Why does the model want to split vectors into these orthogonal subspaces? This seems somewhat unnatural and wasteful—it loses a lot of degrees of freedom, and surely it wants to spread out and minimise interference as much as possible?
Implicitly, I was imagining something like L2 loss where the model wants to minimise the sum of squared dot products.
New intuition: There is no inherently correct solution to this problem! It all depends on the precise loss function (or, the impact of each pairwise interference on the loss function). If the model has 100 dimensions and needs to fit in 1000 vectors, it can do this by packing 1000 spread out across all 100 dimensions, or by packing 500 into the first 50, and 500 into the second 50. The second approach immediately gives it 500^2 dot products to be 0, at the cost of increasing the dot products within each partition of 500.
Intuitively, there’s going to be some kind of conservation property affecting the total amount of interference, but the model can choose to allocate that towards minimising the number of significant interferences or the maximum interference. Smearing it across all dimensions minimises the maximum, forming a partition minimises the number. So the choice depends on the model’s exact loss function.
In practice, the model’s loss function will be really complicated—for any pair of features, cost of interference goes up if they’re correlated, up if either is important, down if either is sparse, and down if the model can allocate some parameters to denoising the interference. Importantly, for the ones to do with correlation, interference between correlated features will be way worse, so the model wants to finds ways to minimise the max interference, and is happy to tolerate a lot of interference between uncorrelated features. Which means the optimal packing probably involves tegum products, because it’s a nice hack to efficiently get lots of the interference terms to zero.
Probably my biggest remaining confusion is why tegum products are the best way to get a lot of interference terms to zero, rather than just some clever packing smeared across all dimensions.
That’s good to hear! And I agree with your new intuition.
I think if you want interference terms to actually be zero you have to end up with tegum products, because that means you want orthogonal vectors and that implies disjoint subspaces. Right?
I don’t think so? If you have eg 8 vectors arranged evenly in a 2D plane (so at 45 degrees to each other) there’s a lot of orthogonality, but no tegum product. I think the key weirdness of a tegum product is that it’s a partition, where every pair in different bits of the partition is orthogonal. I could totally imagine that eg the best way to fit 2n vectors is n dimensional space is two sets of n orthogonal vectors, but at some arbitrary angle to each other.
I can believe that tegum products are the right way to maximise the number of orthogonal pairs, though that still feels a bit weird to me. (technically, I think that the optimal way to fit kn vectors in R^n is to have n orthogonal directions and k vectors along each direction, maybe with different magnitudes—which is a tegum product. It forming 2D-3D subspaces feels odd though).
I think partitions can get you more orthogonality than your specific example of overlapping orthogonal sets. Take n vectors and pack them into d dimensions in two ways:
A tegum product with k subspaces, giving (n/k) vectors per subspace and n^2*(1-1/k)orthogonal pairs.
(n/d) sets of vectors, each internally orthogonal but each overlapping with the others, giving n*d orthogonal pairs.
If d < n*(1-1/k) the tegum product buys you more orthogonal pairs. If n > d then picking large k (so low-dimensional spaces) makes the tegum product preferred.
This doesn’t mean there isn’t some other arrangement that does better though...
Thanks for writing this! I found this a really helpful post for clarifying my own intuitions. Trying to operationalise what confused me before, and what now feels clear:
Confusion: Why does the model want to split vectors into these orthogonal subspaces? This seems somewhat unnatural and wasteful—it loses a lot of degrees of freedom, and surely it wants to spread out and minimise interference as much as possible?
Implicitly, I was imagining something like L2 loss where the model wants to minimise the sum of squared dot products.
New intuition: There is no inherently correct solution to this problem! It all depends on the precise loss function (or, the impact of each pairwise interference on the loss function). If the model has 100 dimensions and needs to fit in 1000 vectors, it can do this by packing 1000 spread out across all 100 dimensions, or by packing 500 into the first 50, and 500 into the second 50. The second approach immediately gives it 500^2 dot products to be 0, at the cost of increasing the dot products within each partition of 500.
Intuitively, there’s going to be some kind of conservation property affecting the total amount of interference, but the model can choose to allocate that towards minimising the number of significant interferences or the maximum interference. Smearing it across all dimensions minimises the maximum, forming a partition minimises the number. So the choice depends on the model’s exact loss function.
In practice, the model’s loss function will be really complicated—for any pair of features, cost of interference goes up if they’re correlated, up if either is important, down if either is sparse, and down if the model can allocate some parameters to denoising the interference. Importantly, for the ones to do with correlation, interference between correlated features will be way worse, so the model wants to finds ways to minimise the max interference, and is happy to tolerate a lot of interference between uncorrelated features. Which means the optimal packing probably involves tegum products, because it’s a nice hack to efficiently get lots of the interference terms to zero.
Probably my biggest remaining confusion is why tegum products are the best way to get a lot of interference terms to zero, rather than just some clever packing smeared across all dimensions.
That’s good to hear! And I agree with your new intuition.
I think if you want interference terms to actually be zero you have to end up with tegum products, because that means you want orthogonal vectors and that implies disjoint subspaces. Right?
I don’t think so? If you have eg 8 vectors arranged evenly in a 2D plane (so at 45 degrees to each other) there’s a lot of orthogonality, but no tegum product. I think the key weirdness of a tegum product is that it’s a partition, where every pair in different bits of the partition is orthogonal. I could totally imagine that eg the best way to fit 2n vectors is n dimensional space is two sets of n orthogonal vectors, but at some arbitrary angle to each other.
I can believe that tegum products are the right way to maximise the number of orthogonal pairs, though that still feels a bit weird to me. (technically, I think that the optimal way to fit kn vectors in R^n is to have n orthogonal directions and k vectors along each direction, maybe with different magnitudes—which is a tegum product. It forming 2D-3D subspaces feels odd though).
Oh yes you’re totally right.
I think partitions can get you more orthogonality than your specific example of overlapping orthogonal sets. Take n vectors and pack them into d dimensions in two ways:
A tegum product with k subspaces, giving (n/k) vectors per subspace and n^2*(1-1/k)orthogonal pairs.
(n/d) sets of vectors, each internally orthogonal but each overlapping with the others, giving n*d orthogonal pairs.
If d < n*(1-1/k) the tegum product buys you more orthogonal pairs. If n > d then picking large k (so low-dimensional spaces) makes the tegum product preferred.
This doesn’t mean there isn’t some other arrangement that does better though...
Yeah, agreed that’s not an optimal arrangement, that was just a proof of concept for ’non tegum things can get a lot of orthogonality