Here are some empirical observations that I have made on August 14, 2023 to August 19, 2023 that are characteristics of the interpretability of my own matrix dimensionality reduction algorithm. These phenomena that we observe do not occur on all inputs (they sometimes occur partially); and it would be nice if there were a more complete mathematical theory with proofs that explains these empirical phenomena.
Given a (possibly directed and possibly weighed) graph with n nodes represented as a collection of n×n-matrices A1,…,Ar, we will observe that a dimensionality reduction (X1,…,Xr) of (A1,…,Ar) where each Xi is a d×d-matrix (I call this dimensionality reduction an LSRDR) is in many cases the optimal solution to a combinatorial problem for the graph. In this case, we have a complete interpretation of what the dimensionality reduction algorithm is doing.
For this post, let K denote either the field of real numbers or the field of complex numbers (everything also works well when K is the division ring of quaternions).
Notation: ρ(A)=limn→∞∥An∥1/n=max{|λ|:λis an eigenvalue ofA} is the spectral radius of the matrix A. AT denotes the transpose of A while A∗ denotes the adjoint of A. We say that a tuples of matrices (A1,…,Ar) is jointly similar to (B1,…,Br) if there is an invertible C with Bj=CAjC−1 for 1≤j≤r. A⊗B denotes the tensor product of A with B.
Let 1≤d<n. Suppose that A1,…,Ar are n×n-matrices with entries in K. Then we say that a collection (X1,…,Xr) of d×d-matrices with entries in K is an L2,d-spectral radius dimensionality reduction (abbreviated LSRDR) of A1,…,Ar if the following quantity is locally maximized:ρ(A1⊗(X∗1)T+⋯+Ar⊗(X∗r)T)ρ(X1⊗(X∗1)T+⋯+Xr⊗(X∗r)T)1/2. LSRDRs may be computed using a variation of gradient ascent.
If (X1,…,Xr) is an LSRDR of (A1,…,Ar), then one will typically be able to find matrices R,S,P and some λ∈K where Xj=RAjS for 1≤j≤r and where SR=λPand P2=P. We shall call P a L2,d-SRDR projection operator of (A1,…,Ar). (X1⊕0n−d,…,Xr⊕0n−d) is jointly similar to (λPA1P,…,λPArP) where 0n−d is the (n−d)×(n−d)-zero matrix. The L2,d-SRDR projection operator P is typically unique in the sense that if we run the gradient ascent to obtain another L2,d-SRDR projection operator, then we will obtain the same L2,d-SRDR projection operator that we originally had. If B1,…,Br are n×n-matrices, then let Φ(B1,…,Br):Mn(K)→Mn(K) denote the completely positive linear operator defined by Φ(B1,…,Br)(X)=B1XB∗1+⋯+BrXB∗r. Let H denote the dominant eigenvector of Φ(B1,…,Br) with Tr(H)=1, and let G denote the dominant eigenvector of Φ(B1,…,Br)∗ with Tr(G)=1. Then the eigenvectors H,G will typically be positive semidefinite matrices.
Suppose now that V1,…,Vq are finite dimensional K-inner product spaces. Let V=V1⊕⋯⊕Vq. Let A1,…,Ar:V→V be linear transformations. Suppose that for 1≤j≤r, there are uj,vj∈{1,…,q} where if x∈Vu,y∈Vv and ⟨Ax,y⟩≠0, thenu=uj,v=vj. Suppose that P:V→V is a L2,d-SRDR projection operator for (A1,…,Ar). Then we will typically be able to find linear operators Pj:Vj→Vj for 1≤j≤q where P=P1⊕⋯⊕Pq. Since P=P2, we observe that Pj=P2j for all j. As a consequence, there will be positive semidefinite operators Gj,Hj:Vj→Vj for 1≤j≤q where G=G1⊕⋯⊕Gq,H=H1⊕⋯⊕Hq.
Let V={1,…,n} be a vertex set. Suppose that 1≤d<n. Let f:V×V→K be a function. For example, the function f could denote a weight matrix of a graph or neural network. For each i,j, let Ai,j be the n×n-matrix where the (i,j)-th entry is f(i,j) and all the other entries are zero. Then we will typically be able to find an L2,d−SRDR projection operator P of (Ai,j)i,j along with a set T⊆{1,…,n} where P is the diagonal matrix where the i-th diagonal entry is 1 for i∈T and 0 otherwise. The set T represents a dominant cluster of size d in the set V. Let A=(ai,j)i,j be the n×n-matrix where ai,j=|f(i,j)|2 for all i,j. If S⊆V, then set
A|S=(ai,j⋅χS(i)⋅χS(j))1≤i≤n,1≤j≤n where χS(i)=1 whenever i∈S and χS(i)=0 otherwise. In other words, if A|s=(bi,j)i,j, then bi,j=ai,j whenever i,j∈S and bi,j=0 otherwise. Then the dominant cluster T will typically be the subset of V of size d where the spectral radius ρ(A|T) is maximized.
We say that a square matrix C with non-negative real entries is a primitive matrix if there is some n where each entry in Cn is positive. Suppose now that C is the direct sum of a primitive matrix and a zero matrix. Then the spectral radius ρ(C) is the dominant eigenvalue of C, and the root ρ(C) of the characteristic polynomial of C has multiplicity 1. Furthermore, there is an vector v with non-negative real entries with Cv=ρ(C)v. We shall call v the Perron-Frobenius eigenvector of C.
For T⊆{1,…,n}, let vT be the dominant eigenvector of A|T where the sum of the entries in vT is 1, and let wT be the dominant eigenvector of (A|T)∗ where the sum of the entries in wT is 1. If u is a vector, then let diag(u) denote the matrix where u is the list of diagonal entries in u. Then H=diag(vT),G=diag(wT).
The problem of maximizing ρ(A|T) is a natural problem that is meaningful for adjacency matrices of (weighted) graphs/digraphs and Markov chains. If f is the adjacency matrix of a graph or a digraph G, then the value ρ(A|T) is a measure of how internally connected the underlying graph is, and if the graph is undirected and simple, then ρ(A|T) is maximized when T is a clique (recall that a subset of an simple undirected graph is a clique if all of the pairs of distinct nodes are connected to each other). More specifically, the number of paths in the induced subgraph G[T] with m edges will be about ρ(A|T)m. To make this statement precise, if there are tm paths in the induced subgraph G[T] of length m, then limm→∞t1/mm=ρ(A|T). Therefore, the set T maximizes the number of paths in the induced subgraph G[T] of length m for large m.
The problem of finding a clique of size d in a graph is an NP-complete problem, so we should not expect for there to be an algorithm that always solves this problem efficiently. On the other hand, for many NP-complete problems, there are plenty of heuristic algorithms that give decent solutions in most cases. The use of LSRDRs to find the clique T is another kind of heuristic algorithm that can be used to find the largest clique in a graph and solve more general problems. But the NP-completeness of the problem of finding a clique of size d in a graph also indicates that LSRDRs most likely are unable to find produce cliques in exceptionally difficult graphs.
If A is the transition matrix of an irreducible and aperiodic Markov chain (Xm)m≥0, then the probability that Xm,…,Xm+k∈T will be approximately ρ(A|T)k. More precisely, limk→∞P(X0,…,Xk∈T)1/k=ρ(A|T). In this case, the set T maximizes the probability P(X0,…,Xk∈T) for large values k.
Maximizing the total weight of edges of an induced subgraph:
If Y is a matrix, then let ∑Y denote the sum of all the entries in Y.
Proposition: Suppose that Md is the d×d matrix where each entry of Md is 1/d. Let X be a real-valued matrix. Then limδ→0ρ(Md+δX)−ρ(Md)δ=∑X/d.
Let Mn,d be the n×n-matrix where each entry of Mn,d is 1/d. Let B be a real n×n-matrix. For simplicity, assume that the value ∑B|T is distinct for each T⊆{1,…,n}. Let A=Mn,d+δB. Then for sufficiently small δ, the spectral radius ρ(A|T) is maximized (subject to the condition that |T|=d) precisely when the sum ∑B|T is maximized. LSRDRs may therefore be used to find the subset T⊆{1,…,n} with |T|=d that maximizes ∑B|T.
Why do LSRDRs behave this way?
Suppose that (A1,…,Ar) are complex matrices that generate the algebra Mn(C). Then there is some invertible B and constant λ where the operator Φ(λBA1B−1,…,λBArB−1) is a quantum channel (by a quantum channel, I mean a completely positive trace preserving superoperator), so LSRDRs should be considered to be dimensionality reductions of quantum channels E:Mn(C)→Mn(C). Primitive matrices can be associated with stochastic matrices in the same way; if A is a primitive matrix, then there is a diagonal matrix D and a constant λ where λD−1AD is a stochastic matrix. One should consider the LSRDR of (Ai,j)i,j to be a dimensionality reduction for Markov chains. The most sensible way to take a dimensionality reduction of an n-state Markov chain is to select d states so that those d states make a subset that is in some sense optimal. And, for LSRDRs, the best choice of a d element subset T of {1,…,n} is the option that maximizes ρ(A|T).
Conclusions:
The LSRDRs of (Ai,j)i,j have a notable combination of features of interpretability; these LSRDRs tend to converge to the same local maxima (up-to-joint similarity and a constant factor) regardless of the initialization, and we are able to give an explicit expression for these local maxima. We also have a duality between the problem of computing the LSRDR of (Ai,j)i,j and the problem of maximizing ρ(A|T) where |T|=d. With this duality, the LSRDR of (Ai,j)i,j is fully interpretable as a solution to a combinatorial optimization problem.
I hope to make more posts about some of my highly interpretable machine learning algorithms together with some of my tools that we can use to interpret AI.
Interpreting a dimensionality reduction of a collection of matrices as two positive semidefinite block diagonal matrices
Here are some empirical observations that I have made on August 14, 2023 to August 19, 2023 that are characteristics of the interpretability of my own matrix dimensionality reduction algorithm. These phenomena that we observe do not occur on all inputs (they sometimes occur partially); and it would be nice if there were a more complete mathematical theory with proofs that explains these empirical phenomena.
Given a (possibly directed and possibly weighed) graph with n nodes represented as a collection of n×n-matrices A1,…,Ar, we will observe that a dimensionality reduction (X1,…,Xr) of (A1,…,Ar) where each Xi is a d×d-matrix (I call this dimensionality reduction an LSRDR) is in many cases the optimal solution to a combinatorial problem for the graph. In this case, we have a complete interpretation of what the dimensionality reduction algorithm is doing.
For this post, let K denote either the field of real numbers or the field of complex numbers (everything also works well when K is the division ring of quaternions).
Notation: ρ(A)=limn→∞∥An∥1/n=max{|λ|:λis an eigenvalue ofA} is the spectral radius of the matrix A. AT denotes the transpose of A while A∗ denotes the adjoint of A. We say that a tuples of matrices (A1,…,Ar) is jointly similar to (B1,…,Br) if there is an invertible C with Bj=CAjC−1 for 1≤j≤r. A⊗B denotes the tensor product of A with B.
Let 1≤d<n. Suppose that A1,…,Ar are n×n-matrices with entries in K. Then we say that a collection (X1,…,Xr) of d×d-matrices with entries in K is an L2,d-spectral radius dimensionality reduction (abbreviated LSRDR) of A1,…,Ar if the following quantity is locally maximized:ρ(A1⊗(X∗1)T+⋯+Ar⊗(X∗r)T)ρ(X1⊗(X∗1)T+⋯+Xr⊗(X∗r)T)1/2. LSRDRs may be computed using a variation of gradient ascent.
If (X1,…,Xr) is an LSRDR of (A1,…,Ar), then one will typically be able to find matrices R,S,P and some λ∈K where Xj=RAjS for 1≤j≤r and where SR=λPand P2=P. We shall call P a L2,d-SRDR projection operator of (A1,…,Ar). (X1⊕0n−d,…,Xr⊕0n−d) is jointly similar to (λPA1P,…,λPArP) where 0n−d is the (n−d)×(n−d)-zero matrix. The L2,d-SRDR projection operator P is typically unique in the sense that if we run the gradient ascent to obtain another L2,d-SRDR projection operator, then we will obtain the same L2,d-SRDR projection operator that we originally had. If B1,…,Br are n×n-matrices, then let Φ(B1,…,Br):Mn(K)→Mn(K) denote the completely positive linear operator defined by Φ(B1,…,Br)(X)=B1XB∗1+⋯+BrXB∗r. Let H denote the dominant eigenvector of Φ(B1,…,Br) with Tr(H)=1, and let G denote the dominant eigenvector of Φ(B1,…,Br)∗ with Tr(G)=1. Then the eigenvectors H,G will typically be positive semidefinite matrices.
Suppose now that V1,…,Vq are finite dimensional K-inner product spaces. Let V=V1⊕⋯⊕Vq. Let A1,…,Ar:V→V be linear transformations. Suppose that for 1≤j≤r, there are uj,vj∈{1,…,q} where if x∈Vu,y∈Vv and ⟨Ax,y⟩≠0, thenu=uj,v=vj. Suppose that P:V→V is a L2,d-SRDR projection operator for (A1,…,Ar). Then we will typically be able to find linear operators Pj:Vj→Vj for 1≤j≤q where P=P1⊕⋯⊕Pq. Since P=P2, we observe that Pj=P2j for all j. As a consequence, there will be positive semidefinite operators Gj,Hj:Vj→Vj for 1≤j≤q where G=G1⊕⋯⊕Gq,H=H1⊕⋯⊕Hq.
Application: Weighed graph/digraph dominant clustering.
Let V={1,…,n} be a vertex set. Suppose that 1≤d<n. Let f:V×V→K be a function. For example, the function f could denote a weight matrix of a graph or neural network. For each i,j, let Ai,j be the n×n-matrix where the (i,j)-th entry is f(i,j) and all the other entries are zero. Then we will typically be able to find an L2,d−SRDR projection operator P of (Ai,j)i,j along with a set T⊆{1,…,n} where P is the diagonal matrix where the i-th diagonal entry is 1 for i∈T and 0 otherwise. The set T represents a dominant cluster of size d in the set V. Let A=(ai,j)i,j be the n×n-matrix where ai,j=|f(i,j)|2 for all i,j. If S⊆V, then set
A|S=(ai,j⋅χS(i)⋅χS(j))1≤i≤n,1≤j≤n where χS(i)=1 whenever i∈S and χS(i)=0 otherwise. In other words, if A|s=(bi,j)i,j, then bi,j=ai,j whenever i,j∈S and bi,j=0 otherwise. Then the dominant cluster T will typically be the subset of V of size d where the spectral radius ρ(A|T) is maximized.
We say that a square matrix C with non-negative real entries is a primitive matrix if there is some n where each entry in Cn is positive. Suppose now that C is the direct sum of a primitive matrix and a zero matrix. Then the spectral radius ρ(C) is the dominant eigenvalue of C, and the root ρ(C) of the characteristic polynomial of C has multiplicity 1. Furthermore, there is an vector v with non-negative real entries with Cv=ρ(C)v. We shall call v the Perron-Frobenius eigenvector of C.
For T⊆{1,…,n}, let vT be the dominant eigenvector of A|T where the sum of the entries in vT is 1, and let wT be the dominant eigenvector of (A|T)∗ where the sum of the entries in wT is 1. If u is a vector, then let diag(u) denote the matrix where u is the list of diagonal entries in u. Then H=diag(vT),G=diag(wT).
The problem of maximizing ρ(A|T) is a natural problem that is meaningful for adjacency matrices of (weighted) graphs/digraphs and Markov chains. If f is the adjacency matrix of a graph or a digraph G, then the value ρ(A|T) is a measure of how internally connected the underlying graph is, and if the graph is undirected and simple, then ρ(A|T) is maximized when T is a clique (recall that a subset of an simple undirected graph is a clique if all of the pairs of distinct nodes are connected to each other). More specifically, the number of paths in the induced subgraph G[T] with m edges will be about ρ(A|T)m. To make this statement precise, if there are tm paths in the induced subgraph G[T] of length m, then limm→∞t1/mm=ρ(A|T). Therefore, the set T maximizes the number of paths in the induced subgraph G[T] of length m for large m.
The problem of finding a clique of size d in a graph is an NP-complete problem, so we should not expect for there to be an algorithm that always solves this problem efficiently. On the other hand, for many NP-complete problems, there are plenty of heuristic algorithms that give decent solutions in most cases. The use of LSRDRs to find the clique T is another kind of heuristic algorithm that can be used to find the largest clique in a graph and solve more general problems. But the NP-completeness of the problem of finding a clique of size d in a graph also indicates that LSRDRs most likely are unable to find produce cliques in exceptionally difficult graphs.
If A is the transition matrix of an irreducible and aperiodic Markov chain (Xm)m≥0, then the probability that Xm,…,Xm+k∈T will be approximately ρ(A|T)k. More precisely, limk→∞P(X0,…,Xk∈T)1/k=ρ(A|T). In this case, the set T maximizes the probability P(X0,…,Xk∈T) for large values k.
Maximizing the total weight of edges of an induced subgraph:
If Y is a matrix, then let ∑Y denote the sum of all the entries in Y.
Proposition: Suppose that Md is the d×d matrix where each entry of Md is 1/d. Let X be a real-valued matrix. Then limδ→0ρ(Md+δX)−ρ(Md)δ=∑X/d.
Let Mn,d be the n×n-matrix where each entry of Mn,d is 1/d. Let B be a real n×n-matrix. For simplicity, assume that the value ∑B|T is distinct for each T⊆{1,…,n}. Let A=Mn,d+δB. Then for sufficiently small δ, the spectral radius ρ(A|T) is maximized (subject to the condition that |T|=d) precisely when the sum ∑B|T is maximized. LSRDRs may therefore be used to find the subset T⊆{1,…,n} with |T|=d that maximizes ∑B|T.
Why do LSRDRs behave this way?
Suppose that (A1,…,Ar) are complex matrices that generate the algebra Mn(C). Then there is some invertible B and constant λ where the operator Φ(λBA1B−1,…,λBArB−1) is a quantum channel (by a quantum channel, I mean a completely positive trace preserving superoperator), so LSRDRs should be considered to be dimensionality reductions of quantum channels E:Mn(C)→Mn(C). Primitive matrices can be associated with stochastic matrices in the same way; if A is a primitive matrix, then there is a diagonal matrix D and a constant λ where λD−1AD is a stochastic matrix. One should consider the LSRDR of (Ai,j)i,j to be a dimensionality reduction for Markov chains. The most sensible way to take a dimensionality reduction of an n-state Markov chain is to select d states so that those d states make a subset that is in some sense optimal. And, for LSRDRs, the best choice of a d element subset T of {1,…,n} is the option that maximizes ρ(A|T).
Conclusions:
The LSRDRs of (Ai,j)i,j have a notable combination of features of interpretability; these LSRDRs tend to converge to the same local maxima (up-to-joint similarity and a constant factor) regardless of the initialization, and we are able to give an explicit expression for these local maxima. We also have a duality between the problem of computing the LSRDR of (Ai,j)i,j and the problem of maximizing ρ(A|T) where |T|=d. With this duality, the LSRDR of (Ai,j)i,j is fully interpretable as a solution to a combinatorial optimization problem.
I hope to make more posts about some of my highly interpretable machine learning algorithms together with some of my tools that we can use to interpret AI.
Edited: 1/10/2024