Joseph Van Name comments on Six (and a half) intuitions for SVD

Joseph Van Name 8 Jul 2023 19:59 UTC
8 points
0
The singular value decomposition is great and it even works well for complex and quaternionic matrices (and we can even generalize decompositions like the spectral, polar, and singular value decomposition and apply them to bounded linear operators between infinite dimensional Hilbert spaces), but to get a better appreciation of the singular value decomposition, we ought to analyze its deficiencies as well. I am currently researching other linear dimensionality reduction techniques (namely LSRDRs) that work well in the cases when the singular value decomposition is not the best technique to use for a linear dimensionality reduction. These cases include the following:
1. While the SVD approximates a matrix with a low rank matrix, it does not generalize that well to higher order SVDs that decompose tensors in $V_{1} \otimes V_{2} \otimes \dots \otimes V_{n}$ where $V_{1}, \dots, V_{n}$ are vector spaces.
2. The SVD works applies to linear mappings between inner product spaces, but the SVD does not take any additional structure that the linear mappings or inner product spaces have. For example, if we had a tuple of vectors $(v_{1}, \dots, v_{r})$ , then we may want to use a linear dimensionality reduction that does not just consider $(v_{1}, \dots, v_{r})$ as a matrix. For example, it is more meaningful to consider a weight matrix in a neural network as a tuple of row vectors or a tuple of column vectors than just a matrix without additional structure.
3. If we apply the principal component analysis to a collection $(v_{1}, \dots, v_{r})$ of vectors (with mean 0 for simplicity), then the $k$ -dimensional subspace $M$ that best approximates $(v_{1}, \dots, v_{r})$ may fail to cluster together. For example, suppose that $X_{1}, \dots, X_{s}, Y_{1}, \dots, Y_{s}$ are independent normally distributed random variables each with mean 0 where each $X_{j}$ has covariance matrix $I_{n} \oplus 0.1 \cdot I_{n}$ while each $Y_{j}$ has covariance matrix $0.1 \cdot I_{n} \oplus I_{n}$ . If we take a random sample $(x_{1}, \dots, x_{s}, y_{1}, \dots, y_{s})$ from $X_{1}, \dots, X_{s}, Y_{1}, \dots, Y_{s}$ and then perform a principal component analysis to $(x_{1}, \dots, x_{s}, y_{1}, \dots, y_{s})$ to find an $n$ -dimensional subspace of $R^{n} \oplus R^{n}$ , then the principal component analysis will not tell you anything meaningful. We would ideally want to use something similar to the principal component analysis but which instead returns subspaces that are near $R^{n} \oplus {0}$ or ${0} \oplus R^{n}$ . The principal component analysis returns the top $k$ -dimensional affine subspace of a vector space in magnitude, but the principal component analysis does not care if the canonical basis for these $k$ -dimensions form a cluster in any way.
4. Every real, complex, or quaternionic matrix has an SVD, and the SVD is unique (except in the case where we have repeated singular values). While mathematicians tend to like it when something exists and is unique, and computer scientists may find the existence and uniqueness of the singular value decomposition to be useful, the existence and uniqueness of the SVD does have its weaknesses (and existence and uniqueness theorems imply a sort of simplicity that is not the best indicator of good mathematics; good mathematics is often more complicated than what you would get from a simple existence and uniqueness result). One should consider the SVD as a computer program, programming language, and piece of code that always return an output regardless of whether the output makes sense without ever producing an error message. This makes it more difficult to diagnose a problem or determine whether one is using the correct tools in the first place, and this applies to the singular value decomposition as well. The poor behavior of an algorithm could also provide some useful information. For example, suppose that one is analyzing a block cipher round function $E$ using an algorithm $L$ . If the algorithm $L$ produces errors for complicated block cipher round functions but does not produce these errors for simple block cipher round functions, then the presence of one or more errors indicates that the block cipher round function $E$ is secure.
5. If $A$ is a real matrix, but we take the complex or quaternionic SVD of $A$ to get a factorization $A = U D V^{*}$ , then the matrices $U, V$ will be real orthogonal matrices instead of complex or quaternionic matrices. This means that the SVD of a matrix is always well-behaved which is again a problem since this well-behavedness does not necessarily mean that the singular value decomposition is useful for whatever circumstances we are using it for and the poor behavior of a process may provide useful information.
6. The SVD is not exactly new or cutting edge, so it will give one only a limited amount of information about matrices or other objects.
Let $K$ denote either the field of real numbers, the field complex numbers, or the division ring of quaternions. Suppose that $A_{1}, \dots, A_{r}$ are $n \times n$ -matrices with entries in $K$ . If $X_{1}, \dots, X_{r}$ are $d \times d$ -matrices, then define an superoperator $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) : M_{n, d} (K) \to M_{n, d} (K)$ by letting $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) (X) = A_{1} X X_{1}^{*} + \dots + A_{r} X X_{r}^{*}$ whenever $X \in M_{n, d} (K)$ . Define a partial (but nearly total) function $F_{A_{1}, \dots, A_{r}; K} : M_{d} (K)^{r} \to [0, \infty)$ by letting
$F_{A_{1}, \dots, A_{r}; K} (X_{1}, \dots, X_{r}) = \frac{ρ (Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}))}{ρ (Φ (X_{1}, \dots, X_{r}))^{1 / 2}}$ . Here, let $ρ$ denote the spectral radius of a linear operator. We say that $(X_{1}, \dots, X_{r})$ is a $L_{2, d}$ -spectral radius dimensionality reduction (LSRDR) of type $K$ if the quantity $F_{A_{1}, \dots, A_{r}; K} (X_{1}, \dots, X_{r})$ is locally maximized.
One can compute LSRDRs using a flavor of gradient ascent. Don’t worry. Taking an approximate gradient of the $F_{A_{1}, \dots, A_{r}; K}$ is less computationally costly than it sounds, and the gradient ascent should converge to an LSRDR $(X_{1}, \dots, X_{r})$ . If the gradient ascent process fails to quickly converge to an LSRDR $(X_{1}, \dots, X_{r})$ , then LSRDRs may not be the best tool to use.
We say that $(X_{1}, \dots, X_{r}), (Y_{1}, \dots, Y_{r})$ are projectively similar and write $(X_{1}, \dots, X_{r}) ≃_{K} (Y_{1}, \dots, Y_{r})$ if there is some $α \in Z (K)$ ( $Z (K)$ denotes the center of $K$ ) and some invertible matrix $R$ such that $X_{j} = α R Y_{j} R^{- 1}$ for $1 \leq j \leq r$ . Let $[X_{1}, \dots, X_{r}]_{K}$ denote the equivalence class containing $(X_{1}, \dots, X_{r})$ .
The equivalence class $[X_{1}, \dots, X_{r}]_{K}$ of an LSRDR of type $K$ of $(A_{1}, \dots, A_{r})$ is often unique. At the very least, one should only be able to find a few equivalence classes $[X_{1}, \dots, X_{r}]_{K}$ of LSRDRS of type $K$ of $(A_{1}, \dots, A_{r})$ , and the equivalence class $[X_{1}, \dots, X_{r}]_{K}$ of LSRDRs with highest fitness should also be the easiest to find. But if the equivalence class $[X_{1}, \dots, X_{r}]_{K}$ is far from being unique, then this should be an indicator that the notion of taking an LSRDR may not be the best tool to use for analyzing $(A_{1}, \dots, A_{r})$ , so one should try something else in this case.
If $A_{1}, \dots, A_{r}$ are all real matrices but $K = C$ , then the equivalence class $[X_{1}, \dots, X_{r}]_{K}$ of the LSRDR should contain a tuple $(Y_{1}, \dots, Y_{r})$ where each $Y_{i}$ is a real matrix. One can quickly test whether one should be able to find such a tuple $(Y_{1}, \dots, Y_{r})$ given an LSRDR $(X_{1}, \dots, X_{r})$ is to compute $\frac{Tr (X_{i})}{Tr (X_{j})}$ . If $\frac{Tr (X_{i})}{Tr (X_{j})}$ is a real number (up-to a rounding error), then that means that the LSRDR is well-behaved and perhaps an appropriate tool to use, but otherwise the LSRDR may not be the best tool to use.
If we find our LSRDR $(X_{1}, \dots, X_{r})$ of type $K$ of $(A_{1}, \dots, A_{r})$ , then if everything works out well, there should be some matrices $R, S$ where $X_{j} = R A_{j} S$ for $1 \leq j \leq s$ and where $R S = λ \cdot 1_{d}$ and where $S R = λ \cdot P$ for some (not necessarily orthogonal) projection matrix $P$ and constant $λ \in Z (K)$ . If $λ = 1$ , then we say that $R, S$ is constant factor normalized; if $R, S$ is constant factor normalized, then $R S = 1_{d}, S R = P$ , so let us assume that $R, S$ is constant factor normalized to make everything simpler. Let $U_{R}$ be the dominant eigenvector of $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r})$ , and let $U_{L}$ be the dominant eigenvector of $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r})$ . Then there are positive semidefinite matrices $G, H$ and non-zero constants $μ_{G}, μ_{H} \in Z (K)$ where $U_{R} S^{*} = μ_{H} \cdot H, U_{L} R = μ_{G} \cdot G$ . The projection matrix $P$ should be recovered from the positive semidefinite matrices $G, H$ since $Im (H) = Im (P), Im (G) = ker (P)^{⊥}$ , and the positive semidefinite matrices $G, H$ (up-to a constant real factor) should be uniquely determined. The positive semidefinite matrices $G, H$ should be considered to be the dominant clusters of dimensions for $(A_{1}, \dots, A_{r})$ .
Order 2 tensors: Suppose that $v_{1}, \dots, v_{r} \in V$ for some finite dimensional real inner product space $V$ . Then set $A_{j} = v_{j} v_{j}^{*}$ for $1 \leq j \leq r$ . Then $G = H$ , so the positive semidefinite matrix $G$ is our desired dimensionality reduction of $v_{1}, \dots, v_{r}$ . For example, if $M$ is a weight matrix in a neural network, then we can make $v_{1}, \dots, v_{r}$ the columns of $M$ , or we can make $v_{1}, \dots, v_{r}$ the transposes of the rows of $M$ . Since we apply activation functions before and after we apply $M$ , it makes sense to separate $M$ into rows and columns this way. And yes, I have performed computer experiments that indicate that for $A_{j} = v_{j} v_{j}^{*}$ , the matrices $G, H$ do represent a cluster of dimensions (at least sometimes) rather than simply the top $d$ dimensions. I have done the experiment where $(v_{1}, \dots, v_{r}) = (x_{1}, \dots, x_{s}, y_{1}, \dots, y_{s})$ and in this experiment, the matrices $G, H, P$ (up to a constant factor for $G, H$ ) are all approximately the projection matrix that projects onto the subspace $R^{n} \oplus {0}$ .
Order 3 tensors: Suppose that $V, W$ are finite dimensional real or complex inner product spaces and $A : V \to V \otimes W$ is a linear mapping. Observe that $L (V, V \otimes W)$ is canonically isomorphic to $V \otimes V \otimes W$ . Now give $W$ an orthonormal basis $e_{1}, \dots, e_{r}$ , and set $A_{j} = (1_{V} \otimes e_{j}^{*}) A$ for $1 \leq j \leq r$ . Then one can apply an LSRDR to $A_{1}, \dots, A_{r}$ to obtain the positive semidefinite matrices $G, H$ . The positive semidefinite matrices $G, H$ do not depend on the orthonormal basis $e_{1}, \dots, e_{r}$ that we choose. For example, suppose that $O_{1}, O_{2}$ are open subsets of Euclidean spaces of possibly different dimensions and $f : O_{1} \to O_{2}$ is a $C^{2}$ -function where there are $f_{1}, \dots, f_{r} : O_{1} \to R$ where $f (x) = (f_{1} (x), \dots, f_{r} (x))$ for each $x \in O_{1}$ . Then let $A_{j} = H (f_{j}) (x)$ for $1 \leq j \leq r$ where $H (f_{j})$ denotes the Hessian of $f_{j}$ . Then the matrices $G, H$ of an LSRDR of $A_{1}, \dots, A_{r}$ represent a cluster of dimensions in the tangent space at the point $x$ .
Order 4 tensors: Given a vector space $V$ , let $L (V)$ denote the collection of linear maps from $V$ to $V$ . Let $V$ be a finite dimensional complex inner product space. Then there are various ways to put $V \otimes V \otimes V \otimes V$ into a canonical one-to-one correspondence with $L (L (V))$ . Furthermore, the Choi representation gives a one-to-one correspondence between the completely positive operators in $L (L (V))$ and the positive semidefinite operators in $L (V \otimes V)$ . An operator $E \in L (L (V))$ is completely positive if and only if there are $A_{1}, \dots, A_{r} \in L (V)$ where $E (X) = A_{1} X A_{1}^{*} + \dots + A_{r} X A_{r}^{*}$ for all $X \in L (V)$ . Therefore, whenever $E$ is completely positive, we compute a complex LSRDR $(X_{1}, \dots, X_{r})$ of $(A_{1}, \dots, A_{r})$ , and we should get matrices $R, S, P, G, H$ , and $G, H$ give us our desired dimensionality reduction. Of course, given an order 4 tensor, one has to ask whether it is appropriate to use LSRDRs for this order 4 tensor, and one should ask about the best way to use these order 4 tensors to produce an LSRDR.
If this comment were not long enough already, I would give an explanation for why I believe LSRDRs often behave well, but this post is really about the SVDs so I will save my math for another time.