Interpreting a dimensionality reduction of a collection of matrices as two positive semidefinite block diagonal matrices

Here are some empirical observations that I have made on August 14, 2023 to August 19, 2023 that are characteristics of the interpretability of my own matrix dimensionality reduction algorithm. These phenomena that we observe do not occur on all inputs (they sometimes occur partially); and it would be nice if there were a more complete mathematical theory with proofs that explains these empirical phenomena.

Given a (possibly directed and possibly weighed) graph with $n$ nodes represented as a collection of $n \times n$ -matrices $A_{1}, \dots, A_{r}$ , we will observe that a dimensionality reduction $(X_{1}, \dots, X_{r})$ of $(A_{1}, \dots, A_{r})$ where each $X_{i}$ is a $d \times d$ -matrix (I call this dimensionality reduction an LSRDR) is in many cases the optimal solution to a combinatorial problem for the graph. In this case, we have a complete interpretation of what the dimensionality reduction algorithm is doing.

For this post, let $K$ denote either the field of real numbers or the field of complex numbers (everything also works well when $K$ is the division ring of quaternions).

Notation: $ρ (A) = {lim}_{n \to \infty} ∥ A^{n} ∥^{1 / n} = max {| λ | : λ is an eigenvalue of A}$ is the spectral radius of the matrix $A$ . $A^{T}$ denotes the transpose of $A$ while $A^{*}$ denotes the adjoint of $A$ . We say that a tuples of matrices $(A_{1}, \dots, A_{r})$ is jointly similar to $(B_{1}, \dots, B_{r})$ if there is an invertible $C$ with $B_{j} = C A_{j} C^{- 1}$ for $1 \leq j \leq r$ . $A \otimes B$ denotes the tensor product of $A$ with $B$ .

Let $1 \leq d < n$ . Suppose that $A_{1}, \dots, A_{r}$ are $n \times n$ -matrices with entries in $K$ . Then we say that a collection $(X_{1}, \dots, X_{r})$ of $d \times d$ -matrices with entries in $K$ is an $L_{2, d}$ -spectral radius dimensionality reduction (abbreviated LSRDR) of $A_{1}, \dots, A_{r}$ if the following quantity is locally maximized: $\frac{ρ (A_{1} \otimes (X_{1}^{*})^{T} + \dots + A_{r} \otimes (X_{r}^{*})^{T})}{ρ (X_{1} \otimes (X_{1}^{*})^{T} + \dots + X_{r} \otimes (X_{r}^{*})^{T})^{1 / 2}}$ . LSRDRs may be computed using a variation of gradient ascent.

If $(X_{1}, \dots, X_{r})$ is an LSRDR of $(A_{1}, \dots, A_{r})$ , then one will typically be able to find matrices $R, S, P$ and some $λ \in K$ where $X_{j} = R A_{j} S$ for $1 \leq j \leq r$ and where $S R = λ P$ and $P^{2} = P$ . We shall call $P$ a $L_{2, d}$ -SRDR projection operator of $(A_{1}, \dots, A_{r})$ . $(X_{1} \oplus 0_{n - d}, \dots, X_{r} \oplus 0_{n - d})$ is jointly similar to $(λ P A_{1} P, \dots, λ P A_{r} P)$ where $0_{n - d}$ is the $(n - d) \times (n - d)$ -zero matrix. The $L_{2, d}$ -SRDR projection operator $P$ is typically unique in the sense that if we run the gradient ascent to obtain another $L_{2, d}$ -SRDR projection operator, then we will obtain the same $L_{2, d}$ -SRDR projection operator that we originally had. If $B_{1}, \dots, B_{r}$ are $n \times n$ -matrices, then let $Φ (B_{1}, \dots, B_{r}) : M_{n} (K) \to M_{n} (K)$ denote the completely positive linear operator defined by $Φ (B_{1}, \dots, B_{r}) (X) = B_{1} X B_{1}^{*} + \dots + B_{r} X B_{r}^{*}$ . Let $H$ denote the dominant eigenvector of $Φ (B_{1}, \dots, B_{r})$ with $Tr (H) = 1$ , and let $G$ denote the dominant eigenvector of $Φ (B_{1}, \dots, B_{r})^{*}$ with $Tr (G) = 1$ . Then the eigenvectors $H, G$ will typically be positive semidefinite matrices.

Suppose now that $V_{1}, \dots, V_{q}$ are finite dimensional $K$ -inner product spaces. Let $V = V_{1} \oplus \dots \oplus V_{q}$ . Let $A_{1}, \dots, A_{r} : V \to V$ be linear transformations. Suppose that for $1 \leq j \leq r$ , there are $u_{j}, v_{j} \in {1, \dots, q}$ where if $x \in V_{u}, y \in V_{v}$ and $⟨ A x, y ⟩ \neq 0$ , then $u = u_{j}, v = v_{j}$ . Suppose that $P : V \to V$ is a $L_{2, d}$ -SRDR projection operator for $(A_{1}, \dots, A_{r})$ . Then we will typically be able to find linear operators $P_{j} : V_{j} \to V_{j}$ for $1 \leq j \leq q$ where $P = P_{1} \oplus \dots \oplus P_{q}$ . Since $P = P^{2}$ , we observe that $P_{j} = P_{j}^{2}$ for all $j$ . As a consequence, there will be positive semidefinite operators $G_{j}, H_{j} : V_{j} \to V_{j}$ for $1 \leq j \leq q$ where $G = G_{1} \oplus \dots \oplus G_{q}, H = H_{1} \oplus \dots \oplus H_{q}$ .

Application: Weighed graph/digraph dominant clustering.

Let $V = {1, \dots, n}$ be a vertex set. Suppose that $1 \leq d < n$ . Let $f : V \times V \to K$ be a function. For example, the function $f$ could denote a weight matrix of a graph or neural network. For each $i, j$ , let $A_{i, j}$ be the $n \times n$ -matrix where the $(i, j)$ -th entry is $f (i, j)$ and all the other entries are zero. Then we will typically be able to find an $L_{2, d} -$ SRDR projection operator $P$ of $(A_{i, j})_{i, j}$ along with a set $T \subseteq {1, \dots, n}$ where $P$ is the diagonal matrix where the $i$ -th diagonal entry is $1$ for $i \in T$ and $0$ otherwise. The set $T$ represents a dominant cluster of size $d$ in the set $V$ . Let $A = (a_{i, j})_{i, j}$ be the $n \times n$ -matrix where $a_{i, j} = | f (i, j) |^{2}$ for all $i, j$ . If $S \subseteq V$ , then set

$A |_{S} = (a_{i, j} \cdot χ_{S} (i) \cdot χ_{S} (j))_{1 \leq i \leq n, 1 \leq j \leq n}$ where $χ_{S} (i) = 1$ whenever $i \in S$ and $χ_{S} (i) = 0$ otherwise. In other words, if $A |_{s} = (b_{i, j})_{i, j}$ , then $b_{i, j} = a_{i, j}$ whenever $i, j \in S$ and $b_{i, j} = 0$ otherwise. Then the dominant cluster $T$ will typically be the subset of $V$ of size $d$ where the spectral radius $ρ (A |_{T})$ is maximized.

We say that a square matrix $C$ with non-negative real entries is a primitive matrix if there is some $n$ where each entry in $C^{n}$ is positive. Suppose now that $C$ is the direct sum of a primitive matrix and a zero matrix. Then the spectral radius $ρ (C)$ is the dominant eigenvalue of $C$ , and the root $ρ (C)$ of the characteristic polynomial of $C$ has multiplicity $1$ . Furthermore, there is an vector $v$ with non-negative real entries with $C v = ρ (C) v$ . We shall call $v$ the Perron-Frobenius eigenvector of $C$ .

For $T \subseteq {1, \dots, n}$ , let $v_{T}$ be the dominant eigenvector of $A |_{T}$ where the sum of the entries in $v_{T}$ is $1$ , and let $w_{T}$ be the dominant eigenvector of $(A |_{T})^{*}$ where the sum of the entries in $w_{T}$ is $1$ . If $u$ is a vector, then let $diag (u)$ denote the matrix where $u$ is the list of diagonal entries in $u$ . Then $H = diag (v_{T}), G = diag (w_{T})$ .

The problem of maximizing $ρ (A |_{T})$ is a natural problem that is meaningful for adjacency matrices of (weighted) graphs/digraphs and Markov chains. If $f$ is the adjacency matrix of a graph or a digraph $G$ , then the value $ρ (A |_{T})$ is a measure of how internally connected the underlying graph is, and if the graph is undirected and simple, then $ρ (A |_{T})$ is maximized when $T$ is a clique (recall that a subset of an simple undirected graph is a clique if all of the pairs of distinct nodes are connected to each other). More specifically, the number of paths in the induced subgraph $G [T]$ with $m$ edges will be about $ρ (A |_{T})^{m}$ . To make this statement precise, if there are $t_{m}$ paths in the induced subgraph $G [T]$ of length $m$ , then ${lim}_{m \to \infty} t_{m}^{1 / m} = ρ (A |_{T})$ . Therefore, the set $T$ maximizes the number of paths in the induced subgraph $G [T]$ of length $m$ for large $m$ .

The problem of finding a clique of size $d$ in a graph is an NP-complete problem, so we should not expect for there to be an algorithm that always solves this problem efficiently. On the other hand, for many NP-complete problems, there are plenty of heuristic algorithms that give decent solutions in most cases. The use of LSRDRs to find the clique $T$ is another kind of heuristic algorithm that can be used to find the largest clique in a graph and solve more general problems. But the NP-completeness of the problem of finding a clique of size $d$ in a graph also indicates that LSRDRs most likely are unable to find produce cliques in exceptionally difficult graphs.

If $A$ is the transition matrix of an irreducible and aperiodic Markov chain $(X_{m})_{m \geq 0}$ , then the probability that $X_{m}, \dots, X_{m + k} \in T$ will be approximately $ρ (A |_{T})^{k}$ . More precisely, ${lim}_{k \to \infty} P (X_{0}, \dots, X_{k} \in T)^{1 / k} = ρ (A |_{T})$ . In this case, the set $T$ maximizes the probability $P (X_{0}, \dots, X_{k} \in T)$ for large values $k$ .

Maximizing the total weight of edges of an induced subgraph:

If $Y$ is a matrix, then let $\sum Y$ denote the sum of all the entries in $Y$ .

Proposition: Suppose that $M_{d}$ is the $d \times d$ matrix where each entry of $M_{d}$ is $1 / d$ . Let $X$ be a real-valued matrix. Then $_{δ \to 0}^{lim} \frac{ρ (M_{d} + δ X) - ρ (M_{d})}{δ} = \sum X / d$ .

Let $M_{n, d}$ be the $n \times n$ -matrix where each entry of $M_{n, d}$ is $1 / d$ . Let $B$ be a real $n \times n$ -matrix. For simplicity, assume that the value $\sum B |_{T}$ is distinct for each $T \subseteq {1, \dots, n}$ . Let $A = M_{n, d} + δ B$ . Then for sufficiently small $δ$ , the spectral radius $ρ (A |_{T})$ is maximized (subject to the condition that $| T | = d$ ) precisely when the sum $\sum B |_{T}$ is maximized. LSRDRs may therefore be used to find the subset $T \subseteq {1, \dots, n}$ with $| T | = d$ that maximizes $\sum B |_{T}$ .

Why do LSRDRs behave this way?

Suppose that $(A_{1}, \dots, A_{r})$ are complex matrices that generate the algebra $M_{n} (C)$ . Then there is some invertible $B$ and constant $λ$ where the operator $Φ (λ B A_{1} B^{- 1}, \dots, λ B A_{r} B^{- 1})$ is a quantum channel (by a quantum channel, I mean a completely positive trace preserving superoperator), so LSRDRs should be considered to be dimensionality reductions of quantum channels $E : M_{n} (C) \to M_{n} (C)$ . Primitive matrices can be associated with stochastic matrices in the same way; if $A$ is a primitive matrix, then there is a diagonal matrix $D$ and a constant $λ$ where $λ D^{- 1} A D$ is a stochastic matrix. One should consider the LSRDR of $(A_{i, j})_{i, j}$ to be a dimensionality reduction for Markov chains. The most sensible way to take a dimensionality reduction of an $n$ -state Markov chain is to select $d$ states so that those $d$ states make a subset that is in some sense optimal. And, for LSRDRs, the best choice of a $d$ element subset $T$ of ${1, \dots, n}$ is the option that maximizes $ρ (A |_{T})$ .

Conclusions:

The LSRDRs of $(A_{i, j})_{i, j}$ have a notable combination of features of interpretability; these LSRDRs tend to converge to the same local maxima (up-to-joint similarity and a constant factor) regardless of the initialization, and we are able to give an explicit expression for these local maxima. We also have a duality between the problem of computing the LSRDR of $(A_{i, j})_{i, j}$ and the problem of maximizing $ρ (A |_{T})$ where $| T | = d$ . With this duality, the LSRDR of $(A_{i, j})_{i, j}$ is fully interpretable as a solution to a combinatorial optimization problem.

I hope to make more posts about some of my highly interpretable machine learning algorithms together with some of my tools that we can use to interpret AI.

Edited: 1/10/2024