Every entry in a matrix counts for the L2-spectral radius similarity. Suppose that A1,…,Ar,B1,…,Br are real n×n-matrices. Set A⊗2=A⊗A. Define the L2-spectral radius similarity between (A1,…,Ar) and (B1,…,Br) to be the number
ρ(A1⊗B1+⋯+Ar⊗Br)ρ(A⊗21+⋯+A⊗2r)1/2ρ(B⊗21+⋯+B⊗2r)1/2. Then the L2-spectral radius similarity is always a real number in the interval [0,1], so one can think of the L2-spectral radius similarity as a generalization of the value |⟨u,v⟩|∥u∥⋅∥v∥ where u,v are real or complex vectors. It turns out experimentally that if A1,…,Ar are random real matrices, and each Bj is obtained from Aj by replacing each entry in Bj with 0 with probability 1−α, then the L2-spectral radius similarity between (A1,…,Ar) and (B1,…,Br) will be about √α. If u=(A1,…,Ar),v=(B1,…,Br), then observe that |⟨u,v⟩|∥u∥⋅∥v∥≈√α as well.
Suppose now that A1,…,Ar are random real n×n matrices and C1,…,Cr are the m×m submatrices of A1,…,Ar respectively obtained by only looking at the first m rows and columns of A1,…,Ar. Then the L2-spectral radius similarity between A1,…,Ar and C1,…,Cr will be about √m/n. We can therefore conclude that in some sense C1,…,Cr is a simplified version of A1,…,Ar that more efficiently captures the behavior of A1,…,Ar than B1,…,Br does.
If A1,…,Ar,B1,…,Br are independent random matrices with standard Gaussian entries, then the L2-spectral radius similarity between (A1,…,Ar) and (B1,…,Br) will be about 1/√r with small variance. If u,v are random Gaussian vectors of length r, then |⟨u,v⟩|∥u∥⋅∥v∥ will on average be about c/√r for some constant c, but |⟨u,v⟩|∥u∥⋅∥v∥ will have a high variance.
These are some simple observations that I have made about the spectral radius during my research for evaluating cryptographic functions for cryptocurrency technologies.
Your notation is confusing me. If r is the size of the list of matrices, then how can you have a probability of 1-r for r>=2? Maybe you mean 1-1/r and sqrt{1/r} instead of 1-r and sqrt{r} respectively?
Thanks for pointing that out. I have corrected the typo. I simply used the symbol r for two different quantities, but now the probability is denoted by the symbol α.
I have originally developed a machine learning notion which I call an LSRDR (L2,d
-spectral radius dimensionality reduction), and LSRDRs (and similar machine learning models) behave mathematically and they have a high level of interpretability which should be good for AI safety. Here, I am giving an example of how LSRDRs behave mathematically and how one can get the most out of interpreting an LSRDR.
Suppose that n is a natural number. Let N denote the quantum channel that takes an n qubit quantum state and selects one of those qubits at random and send that qubit through the completely depolarizing channel (the completely depolarizing channel takes a state as input and returns the completely mixed state as an output).
If A1,…,Ar,B1,…,Br are complex matrices, then define superoperators Φ(A1,…,Ar) and Γ(A1,…,Ar;B1,…,Br) by setting
Φ(A1,…,Ar)(X)=∑rk=1AkXA∗k and Γ(A1,…,Ar;B1,…,Br)=∑rk=1AkXB∗k for all X.
Given tuples of matrices (A1,…,Ar),(B1,…,Br), define the L_2-spectral radius similarity between these tuples of matrices by setting
Suppose now that A1,…,A4n are matrices where N=Φ(A1,…,A4n). Let 1≤d≤n. We say that a tuple of complex d by d matrices (X1,…,X4n) is an LSRDR of A1,…,A4n if the quantity ∥(A1,…,A4n)≃(X1,…,X4n)∥2 is locally maximized.
Suppose now that X1,…,X4n are complex 2×2-matrices and (X1,…,X4n) is an LSRDR of A1,…,A4n. Then my computer experiments indicate that there will be some constant λ where λΓ(A1,…,A4n;X1,…,X4n) is similar to a positive semidefinite operator with eigenvalues {0,…,n+1} and where the eigenvalue j has multiplicity 3⋅C(n−1,k)+C(n−1,k−2) where C(⋅,⋅) denotes the binomial coefficient. I have not had a chance to try to mathematically prove this. Hooray. We have interpreted the LSRDR (X1,…,X4n) of (A1,…,A4n), and I have plenty of other examples of interpreted LSRDRs.
We also have a similar pattern for the spectrum of Φ(A1,…,A4n). My computer experiments indicate that there is some constant λ where λ⋅Φ(A1,…,A4n) has spectrum {0,…,n} where the eigenvalue j has multiplicity 3n−j⋅C(n,j).
The notion of the linear regression is an interesting machine learning algorithm in the sense that it can be studied mathematically, but the notion of a linear regression is a quite limited machine learning algorithm as most relations are non-linear. In particular, the linear regression does not give us any notion of any uncertainty in the output.
One way to extend the notion of the linear regression to encapsulate uncertainty in the outputs is to regress a function not to a linear transformation mapping vectors to vectors, but to regress the function to a transformation that maps vectors to mixed states. And the notion of a quantum channel is an appropriate transformation that maps vectors to mixed states. One can optimize this quantum channel using gradient ascent.
For this post, I will only go through some basic facts about quantum information theory. The reader is referred to the book The Theory of Quantum Information by John Watrous for all the missing details.
Let V be a complex Euclidean space. Let L(V) denote the vector space of linear operators from V to V. Given complex Euclidean spaces V,W, we say that a linear operator E from L(V) to L(W) is a trace preserving if Tr(E(X))=Tr(X)
for all X, and we say that E is completely positive if there are linear transformations A1,...,Ar where E(X)=A1XA∗1+⋯+ArXA∗r for all X; the value r is known as the Choi rank of E. A completely positive trace preserving operator is known as a quantum channel.
The collection of quantum channels from L(V) to L(W) is compact and convex.
If W is a complex Euclidean space, then let Dp(W)
denote the collection of pure states in W. Dp(W)
can be defined either as the set of unit vector in W modulo linear dependence, or Dp(W)
can be also defined as the collection of positive semidefinite rank-1 operators on W with trace 1.
Given complex Euclidean spaces U,V and a (multi) set of r distinct ordered pairs of unit vectors f={(u1,v1),…,(un,vn)}⊆U×V, and given a quantum channel
E:L(U)→L(V), we define the fitness level F(f,E)=∑rk=1log(E(uku∗k)vk,vk⟩ and the loss level L(f,E)=∑rk=1−log(E(uku∗k)vk,vk⟩.
We may locally optimize E to minimize its loss level using gradient descent, but there is a slight problem. The set of quantum channels spans the L(L(U),L(V)) which has dimension Dim(U)2⋅Dim(V)2. Due to the large dimension, any locally optimal E will contain Dim(U)2⋅Dim(V)2 many parameters, and this is a large quantity of parameters for what is supposed to be just a glorified version of a linear regression. Fortunately, instead of taking all quantum channels into consideration, we can limit the scope the quantum channels of limited Choi rank.
Empirical Observation: Suppose that U,V are complex Euclidean spaces, f⊆U×V is finite and r is a positive integer. Then computer experiments indicate that there is typically only one quantum channel E:L(U)→L(V) of Choi rank at most r where L(f,E) is locally minimized. More formally, if we run the experiment twice and produce two quantum channels E1,E2 where L(f,Ej) is locally minimized for j∈{1,2}, then we would typically have E1=E2. We therefore say that when L(f,E) is minimized, E is the best Choi rank r quantum channel approximation to f.
Suppose now that f={(u1,v1),…,(un,vn)}⊆Dp(U)×Dp(V) is a multiset. Then we would ideally like to approximate the function f better by alternating between the best Choi rank r quantum channel approximation and a non-linear mapping. An ideal choice of a non-linear but partial mapping is the function DE that maps a positive semidefinite matrix P to its (equivalence class of) unit dominant eigenvector.
Empirical observation: If f={(u1,v1),…,(un,vn)}⊆Dp(U)×Dp(V) and E is the best Choi rank r quantum channel approximation to f, then let u♯j=DE(E(uju∗j)) for all j, and define f♯={(u♯1,v1),…,(u♯n,vn)}. Let U be a small open neighborhood of f♯. Let g∈U. Then we typically have g♯♯=g♯. More generally, the best Choi rank r quantum channel approximation to g is typically the identity function.
From the above observation, we see that the vector u♯j is an approximation of vj that cannot be improved upon. The mapping DE∘E:Dp(U)→Dp(V) is therefore a trainable approximation to the mapping f and since Dp(U),Dp(V) are not even linear spaces (these are complex projective spaces with non-trivial homology groups), the mapping DE∘E is a non-linear model for the function to f.
I have been investigating machine learning models similar to DE∘E for cryptocurrency research and development as these sorts of machine learning models seem to be useful for evaluating the cryptographic security of some proof-of-work problems and other cryptographic functions like block ciphers and hash functions. I have seen other machine learning models that behave about as mathematically as DE∘E.
I admit that machine learning models like DE∘E are currently far from being as powerful as deep neural networks, but since DE∘E behaves mathematically, the model DE∘E should be considered as a safer and interpretable AI model. The goal is to therefore develop models that are mathematical like DE∘E but which can perform more and more machine learning tasks.
Here is an example of what might happen. Suppose that for each uj, we select a orthonormal basis ej,1,…,ej,s of unit vectors for V. Let R={(uj,ej,k):1≤j≤n,1≤k≤s}. Then
Then for each quantum channel E, by the concavity of the logarithm function (which is the arithmetic-geometric mean inequality), we have
L(R,E)=∑nj=1∑nk=1−log(E(uju∗j)ej,k,ej,k⟩)
≤∑nj=1−log(∑nk=1⟨E(uju∗j)ej,k,ej,k⟩)
=∑nj=1−log(Tr(E)). Here, equality is reached if and only if
E(uju∗j)ej,k,ej,k⟩=E(uju∗j)ej,l,ej,l⟩ for each j,k,l, but this equality can be achieved by the channel
defined by E(X)=Tr(X)⋅I/s which is known as the completely depolarizing channel. This is the channel that always takes a quantum state and returns the completely mixed state. On the other hand, the channel E has maximum Choi rank since the Choi representation of E is just the identity function divided by the rank. This example is not unexpected since for each input of R the possible outputs span the entire space V evenly, so one does not have any information about the output from any particular input except that we know that the output could be anything. This example shows that the channels that locally minimize the loss function L(R,E) are the channels that give us a sort of linear regression of R but where this linear regression takes into consideration uncertainty in the output so the regression of a output of a state is a mixed state rather than a pure state.
We can use the L2−spectral radius similarity to measure more complicated similarities between data sets.
Suppose that A1,…,Ar are m×m-real matrices and B1,…,Br are n×n-real matrices. Let ρ(A) denote the spectral radius of A and let A⊗B denote the tensor product of A with B. Define the L2-spectral radius by setting ρ2(A1,…,Ar)=ρ(A1⊗A1+⋯+Ar⊗Ar)1/2, Define the L2-spectral radius similarity between A1,…,Ar and B1,…,Br as
We observe that if C is invertible and λ is a constant, then
∥(A1,…,Ar)≃(λCB1C−1,…,λCBrC−1)∥2=1.
Therefore, the L2-spectral radius is able to detect and measure symmetry that is normally hidden.
Example: Suppose that u1,…,ur;v1,…,vr are vectors of possibly different dimensions. Suppose that we would like to determine how close we are to obtaining an affine transformation T with T(uj)=vj for all j (or a slightly different notion of similarity). We first of all should normalize these vectors to obtain vectors x1,…,xr;y1,…,yr with mean zero and where the covariance matrix is the identity matrix (we may not need to do this depending on our notion of similarity). Then ∥(x1x∗1,…,xrx∗r)≃(y1y∗1,…,yry∗r)∥2 is a measure of low close we are to obtaining such an affine transformation T. We may be able to apply this notion to determining the distance between machine learning models. For example, suppose that M,N are both the first few layers in a (typically different) neural network. Suppose that a1,…,ar is a set of data points. Then if uj=M(aj) and vj=M(aj), then ∥(x1x∗1,…,xrx∗r)≃(y1y∗1,…,yry∗r)∥2 is a measure of the similarity between M and N.
I have actually used this example to see if there is any similarity between two different neural networks trained on the same data set. For my experiment, I chose a random collection of S⊆{0,1}32×{0,1}32 of ordered pairs and I trained the neural networks M,N to minimize the expected losses E(∥N(a)−b∥2:(a,b)∈S),E(∥M(a)−b∥2:(a,b)∈S). In my experiment, each aj was a random vector of length 32 whose entries were 0′s and 1′s. In my experiment, the similarity ∥(x1x∗1,…,xrx∗r)≃(y1y∗1,…,yry∗r)∥2 was worse than if x1,…,xr,y1,…,yr were just random vectors.
This simple experiment suggests that trained neural networks retain too much random or pseudorandom data and are way too messy in order for anyone to develop a good understanding or interpretation of these networks. In my personal opinion, neural networks should be avoided in favor of other AI systems, but we need to develop these alternative AI systems so that they eventually outperform neural networks. I have personally used the L2-spectral radius similarity to develop such non-messy AI systems including LSRDRs, but these non-neural non-messy AI systems currently do not perform as well as neural networks for most tasks. For example, I currently cannot train LSRDR-like structures to do any more NLP than just a word embedding, but I can train LSRDRs to do tasks that I have not seen neural networks perform (such as a tensor dimensionality reduction).
So in my research into machine learning algorithms that I can use to evaluate small block ciphers for cryptocurrency technologies, I have just stumbled upon a dimensionality reduction for tensors in tensor products of inner product spaces that according to my computer experiments exists, is unique, and which reduces a real tensor to another real tensor even when the underlying field is the field of complex numbers. I would not be too surprised if someone else came up with this tensor dimensionality reduction before since it has a rather simple description and it is in a sense a canonical tensor dimensionality reduction when we consider tensors as homogeneous non-commutative polynomials. But even if this tensor dimensionality reduction is not new, this dimensionality reduction algorithm belongs to a broader class of new algorithms that I have been studying recently such as LSRDRs.
Suppose that K is either the field of real numbers or the field of complex numbers. Let V1,…,Vn be finite dimensional inner product spaces over K with dimensions d1,…,dn respectively. Suppose that Vi has basis ei,1,…,ei,di. Given v∈V1⊗⋯⊗Vn, we would sometimes want to approximate the tensor v with a tensor that has less parameters. Suppose that (m0,…,mn) is a sequence of natural numbers with m0=mn=1. Suppose that Xi,j is a mi−1×mi matrix over the field K for 1≤i≤n and 1≤j≤di. From the system of matrices (Xi,j)i,j, we obtain a tensor T((Xi,j)i,j)=∑i1,…,inei1⊗⋯⊗ein⋅X1,i1…Xn,in. If the system of matrices (Xi,j)i,j locally minimizes the distance ∥v−T((Xi,j)i,j)∥, then the tensor T((Xi,j)i,j) is a dimensionality reduction of v which we shall denote by u.
Intuition: One can associate the tensor product V1⊗⋯⊗Vn with the set of all degree n homogeneous non-commutative polynomials that consist of linear combinations of the monomials of the form x1,i1⋯xn,in. Given, our matrices Xi,j, we can define a linear functional ϕ:V1⊗⋯⊗Vn→K by setting ϕ(p)=p((Xi,j)i,j). But by the Reisz representation theorem, the linear functional ϕ is dual to some tensor in V1⊗⋯⊗Vn. More specifically, ϕ is dual to T((Xi,j)i,j). The tensors of the form T((Xi,j)i,j) are therefore the
Advantages:
In my computer experiments, the reduced dimension tensor u is often (but not always) unique in the sense that if we calculate the tensor u twice, then we will get the same tensor. At least, the distribution of reduced dimension tensors u will have low Renyi entropy. I personally consider the partial uniqueness of the reduced dimension tensor to be advantageous over total uniqueness since this partial uniqueness signals whether one should use this tensor dimensionality reduction in the first place. If the reduced tensor is far from being unique, then one should not use this tensor dimensionality reduction algorithm. If the reduced tensor is unique or at least has low Renyi entropy, then this dimensionality reduction works well for the tensor v.
This dimensionality reduction does not depend on the choice of orthonormal basis ei,1,…,ei,di. If we chose a different basis for each Vi, then the resulting tensor u of reduced dimensionality will remain the same (the proof is given below).
If K is the field of complex numbers, but all the entries in the tensor v happen to be real numbers, then all the entries in the tensor u will also be real numbers.
This dimensionality reduction algorithm is intuitive when tensors are considered as homogeneous non-commutative polynomials.
Disadvantages:
This dimensionality reduction depends on a canonical cyclic ordering the inner product spaces V1,…,Vn.
Other notions of dimensionality reduction for tensors such as the CP tensor dimensionality reduction and the Tucker decompositions are more well-established, and they are obviously attempted generalizations of the singular value decomposition to higher dimensions, so they may be more intuitive to some.
The tensors of reduced dimensionality T((Xi,j)i,j) have a more complicated description than the tensors in the CP tensor dimensionality reduction.
Proposition: The set of tensors of the form ∑i1,…,ine1,i1⊗⋯⊗en,inX1,i1…Xn,in does not depend on the choice of bases (ei,1,…,ei,di)i.
Proof: For each i, let fi,1,…,fi,di be an alternative basis for Vi. Then suppose that ei,j=∑kui,j,kfi,k for each i,j. Then
A failed generalization: An astute reader may have observed that if we drop the requirement that mn=1, then we get a linear functional defined by letting
ϕ(p)=Tr(p((Xi,j)i,j)). This is indeed a linear functional, and we can try to approximate v using a the dual to ϕ, but this approach does not work as well.
In this note, I will continue to demonstrate not only the ways in which LSRDRs (L2,d-spectral radius dimensionality reduction) are mathematical but also how one can get the most out of LSRDRs. LSRDRs are one of the types of machine learning that I have been working on, and LSRDRs have characteristics that tell us that LSRDRs are often inherently interpretable which should be good for AI safety.
Suppose that N is the quantum channel that maps a n qubit state to a n qubit state where we select one of the 6 qubits at random and send it through the completely depolarizing channel (the completely depolarizing channel takes a state as an input and returns the completely mixed state as an output). Suppose that A1,…,A4n are 2n by 2n matrices where N has the Kraus representation N(X)=∑4nk=1AkXA∗k.
The objective is to locally maximize the fitness level ρ(∑4nk=1zkAk)/∥(z1,…,z4n)∥ where the norm in question is the Euclidean norm and where ρ denotes the spectral radius. This is a 1 dimensional case of an LSRDR of the channel N.
Let A=∑4nk=1zkAk when (z1,…,z4n) is selected to locally maximize the fitness level. Then my empirical calculations show that there is some λ where λ∑4nk=1zkAkis positive semidefinite with eigenvalues {0,…,n} and where the eigenvalue k has multiplicity (nk) which is the binomial coefficient. But these are empirical calculations for select values λ; I have not been able to mathematically prove that this is always the case for all local maxima for the fitness level (I have not tried to come up with a proof).
Here, we have obtained a complete characterization of A up-to-unitary equivalence due to the spectral theorem, so we are quite close to completely interpreting the local maximum for our fitness function.
I made a few YouTube videos showcasing the process of maximizing the fitness level here.
I personally like my machine learning algorithms to behave mathematically especially when I give them mathematical data. For example, a fitness function with apparently one local maximum value is a mathematical fitness function. It is even more mathematical if one can prove mathematical theorems about such a fitness function or if one can completely describe the local maxima of such a fitness function. It seems like fitness functions that satisfy these mathematical properties are more interpretable than the fitness functions which do not, so people should investigate such functions for AI safety purposes.
My notion of an LSRDR is a notion that satisfies these mathematical properties. To demonstrate the mathematical behavior of LSRDRs, let’s see what happens when we take an LSRDR of the octonions.
Let K denote either the field of real numbers or the field of complex numbers (K
could also be the division ring of quaternions, but for simplicity, let’s not go there). If A1,…,Ar are n×n-matrices over K, then an LSRDR (L2,d-spectral radius dimensionality reduction) of A1,…,Ar is a collection X1,…,Xr of d×d-matrices that locally maximizes the fitness level
ρ(A1⊗¯¯¯¯¯¯X1+⋯+Ar⊗¯¯¯¯¯¯Xr)ρ(X1⊗¯¯¯¯¯¯X1+⋯+Xr⊗¯¯¯¯¯¯Xr)1/2. ρ denotes the spectral radius function while ⊗ denotes the tensor product and ¯¯¯¯Z denotes the matrix obtained from Z by replacing each entry with its complex conjugate. We shall call the maximum fitness level the L2,d-spectral radius of A1,…,Ar over the field K, and we shall wrote ρK2,d(A1,…,Ar) for this spectral radius.
Define the linear superoperator Γ(A1,…,Ar;X1,…,Xr) by setting
Γ(A1,…,Ar;X1,…,Xr)(X)=A1XX∗1+⋯+ArXX∗r and set Φ(X1,…,Xr)=Γ(X1,…,Xr;X1,…,Xr). Then the fitness level of X1,…,Xr is ρ(Γ(A1,…,Ar;X1,…,Xr))Φ(X1,…,Xr)1/2.
Suppose that V is an 8-dimensional real inner product space. Then the octonionic multiplication operation is the unique up-to-isomorphism bilinear binary operation ∗ on V together with a unit 1 such that∥x∗y∥=∥x∥⋅∥y∥ and 1∗x=x∗1=1 for all x,y∈V. If we drop the condition that the octonions have a unit, then we do not quite have this uniqueness result.
We say that an octonion-like algbera is a 8-dimensional real inner product space V together with a unique up-to-isomorphism bilinear operation ∗ such that ∥x∗y∥=∥x∥⋅∥y∥ for all x,y.
Let V be a specific octonion-like algebra.
Suppose now that e1,…,e8 is an orthonormal basis for V (this does not need to be the standard basis). Then for each j∈{1,…,8}, let Aj be the linear operator from V to V defined by setting Ajv=ej∗v for all vectors v. All non-zero linear combinations of A1,…,A8 are conformal mappings (this means that they preserve angles). Now that we have turned the octonion-like algebra into matrices, we can take an LSRDR of the octonion-like algebras, but when taking the LSRDR of octonion-like algebras, we should not worry about the choice of orthonormal basis e1,…,e8 since I could formulate everything in a coordinate-free manner.
Empirical Observation from computer calculations: Suppose that 1≤d≤8 and K is the field of real numbers. Then the following are equivalent.
The d×d matrices X1,…,X8 are a LSRDR of A1,…,A8 over K where A1⊗X1+⋯+A8⊗X8 has a unique real dominant eigenvalue.
There exists matrices R,S where Xj=RAjS for all j and where SR is an orthonormal projection matrix.
In this case, ρK2,d(A1,…,A8)=√d and this fitness level is reached by the matrices X1,…,X8 in the above equivalent statements. Observe that the superoperator Γ(A1,…,A8;PA1P,…,PA8P) is similar to a direct sum of Γ(A1,…,Ar;X1,…,Xr)) and a zero matrix. But the projection matrix P is a dominant eigenvector of Γ(A1,…,A8;PA1P,…,PA8P) and ofΦ(PA1P,…,PA8P) as well.
I have no mathematical proof of the above fact though.
Now suppose that K=C. Then my computer calculations yield the following complex L2,d-spectral radii: (ρK2,j(A1,…,A8))8j=1
Each time that I have trained a complex LSRDR of A1,…,A8, I was able to find a fitness level that is not just a local optimum but also a global optimum.
In the case of the real LSRDRs, I have a complete description of the LSRDRs of (A1,…,A8). This demonstrates that the octonion-like algebras are elegant mathematical structures and that LSRDRs behave mathematically in a manner that is compatible with the structure of the octonion-like algebras.
I have made a few YouTube videos that animate the process of gradient ascent to maximize the fitness level.
Edit: I have made some corrections to this post on 9/22/2024.
There are some cases where we have a complete description for the local optima for an optimization problem. This is a case of such an optimization problem.
Such optimization problems are useful for AI safety since a loss/fitness function where we have a complete description of all local or global optima is a highly interpretable loss/fitness function, and so one should consider using these loss/fitness functions to construct AI algorithms.
Theorem: Suppose that U is a real,complex, or quaternionic n×n-matrix that minimizes the quantity ∥U∥2+∥U−1∥2. Then U is unitary.
Proof: The real case is a special case of a complex case, and by representing each n×n-quaternionic matrix as a complex 2n×2n-matrix, we may assume that U is a complex matrix.
By the Schur decomposition, we know that U=VTV∗ where V is a unitary matrix and T is upper triangular. But we know that ∥U∥2=∥T∥2. Furthermore, U−1=VT−1V∗, so ∥U−1∥2=∥T−1∥2. Let D denote the diagonal matrix whose diagonal entries are the same as T. Then ∥T∥2≥∥D∥2 and ∥T−1∥2≥∥D−1∥2. Furthermore, ∥T∥2=∥D∥2 iff T is diagonal and ∥T−1∥2=∥D−1∥2 iff D is diagonal. Therefore, since ∥U∥2+∥U−1∥2=∥T∥2+∥T−1∥2 and ∥T∥2+∥T−1∥2 is minimized, we can conclude that T=D, so T is a diagonal matrix. Suppose that T has diagonal entries (z1,…,zn). By the arithmetic-geometric mean equality and the Cauchy-Schwarz inequality, we know that 12⋅(∥(z1,…,zn)∥2+∥(z−11,…,z−1n)∥2)≥∥(|z1|,…,|zn|)∥2⋅∥(|z−11|,…,|z−1n)|∥2
≥⟨(|z1|,…,|zn|),(|z−11|,…,|z−1n)|⟩=√n.
Here, the equalities hold if and only if |zj|=1 for all j, but this implies that U is unitary. Q.E.D.
The L2-spectral radius similarity is not transitive. Suppose that A1,…,Ar are m×m-matrices and B1,…,Br are real n×n-matrices. Then define ρ2(A1,…,Ar)=ρ(A1⊗A1+⋯+Ar⊗Ar)1/2. Then the generalized Cauchy-Schwarz inequality is satisfied:
ρ(A1⊗B1+⋯+Ar⊗Br)≤ρ2(A1,…,Ar)ρ2(B1,…,Br).
We therefore define the L2,d-spectral radius similarity between (A1,…,Ar) and (B1,…,Br) as ∥(A1,…,Ar)≃(B1,…,Br)∥=ρ(A1⊗B1+⋯+Ar⊗Br)ρ2(A1,…,Ar)ρ2(B1,…,Br). One should think of the L2-spectral radius similarity as a generalization of the cosine similarity ⟨u,v⟩∥u∥⋅∥v∥ between vectors u,v. I have been using the L2-spectral radius similarity to develop AI systems that seem to be very interpretable. The L2-spectral radius similarity is not transitive.
∥(A1,…,Ar)≃(A1⊕B1,…,Ar⊕Br)∥=1 and
∥(B1,…,Br)≃(A1⊕B1,…,Ar⊕Br)∥=1, but ∥(A1,…,Ar)≃(B1,…,Br)∥ can take any value in the interval [0,1].
We should therefore think of the L2,d-spectral radius similarity as a sort of least upper bound of [0,1]-valued equivalence relations than a [0,1]-valued equivalence relation. We need to consider this as a least upper bound because matrices have multiple dimensions.
Notation: ρ(A)=limn→∞∥An∥1/n is the spectral radius. The spectral radius A is the largest magnitude of an eigenvalue of the matrix A. Here the norm does not matter because we are taking the limit.A⊕B is the direct sum of matrices while A⊗B denotes the Kronecker product of matrices.
Set up: Let K denote either the field of real or the field of complex numbers. Suppose that d1,…,dr are positive integers. Let m0,…,mn be a sequence of positive integers with m0=mn=1. Suppose that Xi,j is an mi−1×mi-matrix whenever 1≤j≤di. Then from the matrices Xi,j, we can define a d1×⋯×dr-tensor T((Xi,j)i,j)=(X1,i1…Xn,in)i1,…,in. I have been doing computer experiments where I use this tensor to approximate other tensors by minimizing the ℓ2-distance. I have not seen this tensor approximation algorithm elsewhere, but perhaps someone else has produced this tensor approximation construction before. In previous shortform posts on this site, I have given evidence that the tensor dimensionality reduction behaves well, and in this post, we will focus on ways to compute with the tensors T((Xi,j)i,j), namely the inner product of such tensors and the gradient of the inner product with respect to the matrices (Xi,j)i,j.
Notation: If A1,…,Ar,B1,…,Br are matrices, then let Γ(A1,…,Ar;B1,…,Br) denote the superoperator defined by letting Γ(A1,…,Ar;B1,…,Br)(X)=A1XB∗1+⋯+ArXB∗r. Let Φ(A1,…,Ar)=Γ(A1,…,Ar;A1,…,Ar).
Inner product: Here is the computation of the inner product of our tensors.
In particular, ∥T((Ai,j)i,j)∥2=Φ(A1,1,…,A1,d1)…Φ(An,1,…,An,dn).
Gradient: Observe that ∇XTr(AX)=AT. We will see shortly that the cyclicity of the trace is useful for calculating the gradient. And here is my manual calculation of the gradient of the inner product of our tensors.
So in my research into machine learning algorithms, I have stumbled upon a dimensionality reduction algorithm for tensors, and my computer experiments have so far yielded interesting results. I am not sure that this dimensionality reduction is new, but I plan on generalizing this dimensionality reduction to more complicated constructions that I am pretty sure are new and am confident would work well.
Suppose that K is either the field of real numbers or the field of complex numbers. Suppose that d1,…,dn are positive integers and (m0,…,mn) is a sequence of positive integers with m0=mn=1. Suppose that Xi,j is an mi−1×mi-matrix whenever 1≤j≤di. Then define a tensor T((Xi,j))=(X1,i1…Xn,in)i1,…,in∈Kd1⊗⋯⊗Kdn.
If v∈Kd1⊗⋯⊗Kdn, and (Xi,j)i,j is a system of matrices that minimizes the value ∥v−T((Xi,j))∥, then T((Xi,j)i,j) is a dimensionality reduction of (Xi,j)i,j, and we shall denote let u denote the tensor of reduced dimension T((Xi,j)i,j). We shall call u a matrix table to tensor dimensionality reduction of type (m0,…,mn).
Observation 1: (Sparsity) If v is sparse in the sense that most entries in the tensor v are zero, then the tensor u will tend to have plenty of zero entries, but as expected, u will be less sparse than v.
Observation 2: (Repeated entries) If v is sparse and v=(xi1,…,in)i1,…,in and the set {xi1,…,in:i1,…,in} has small cardinality, then the tensor u will contain plenty of repeated non-zero entries.
Observation 3: (Tensor decomposition) Let v be a tensor. Then we can often find a matrix table to tensor dimensionality reduction u of type (m0,…,mn) so that v−u is its own matrix table to tensor dimensionality reduction.
Observation 4: (Rational reduction) Suppose that v is sparse and the entries in v are all integers. Then the value ∥u−v∥2 is often a positive integer in both the case when u has only integer entries and in the case when u has non-integer entries.
Observation 5: (Multiple lines) Let m be a fixed positive even number. Suppose that v is sparse and the entries in v are all of the form r⋅e2πin/m for some integer n and r≥0. Then the entries in u are often exclusively of the form r⋅e2πin/m as well.
Observation 6: (Rational reductions) I have observed a sparse tensor v all of whose entries are integers along with matrix table to tensor dimensionality reductions u1,u2 of v where ∥v−u1∥=3,∥v−u1∥=2,∥u2−u1∥=5.
This is not an exclusive list of all the observations that I have made about the matrix table to tensor dimensionality reduction.
From these observations, one should conclude that the matrix table to tensor dimensionality reduction is a well-behaved machine learning algorithm. I hope and expect this machine learning algorithm and many similar ones to be used to both interpret the AI models that we have and will have and also to construct more interpretable and safer AI models in the future.
Suppose that q,r,d are natural numbers. Let 1<p<∞. Let zi,j be a complex number whenever 1≤i≤q,1≤j≤r. Let L:Md(C)r∖{0}→[−∞,∞) be the fitness function defined by letting L(X1,…,Xr)=(∑qi=1log(ρ(∑rj=1zi,jXj))/q)−log(∥∑rj=1XjX∗j∥p)/2. Here, ρ(X) denotes the spectral radius of a matrix X while ∥X∥p denotes the Schatten p-norm of X.
Now suppose that (A1,…,Ar)∈Md(C)r∖{0} is a tuple that maximizes L(A1,…,Ar). Let M:Cr∖{0}→[−∞,∞) be the fitness function defined by letting M(w1,…,wr)=log(ρ(w1A1+⋯+wrAr))−log(∥(w1,…,wr)∥2). Then suppose that (v1,…,vr)∈Cr∖{0} is a tuple that maximizes M(v1,…,vr). Then we will likely be able to find an ℓ∈{1,…,q} and a non-zero complex number α where (v1,…,vr)=α⋅(xℓ,1,…,xℓ,r).
In this case, (zi,j)i,j represents the training data while the matrices A1,…,Ar is our learned machine learning model. In this case, we are able to recover some original data values from the learned machine learning model A1,…,Ar without any distortion to the data values.
I have just made this observation, so I am still exploring the implications of this observation. But this is an example of how mathematical spectral machine learning algorithms can behave, and more mathematical machine learning models are more likely to be interpretable and they are more likely to have a robust mathematical/empirical theory behind them.
I think that all that happened here was the matrices A1,…,Ar just ended up being diagonal matrices. This means that this is probably an uninteresting observation in this case, but I need to do more tests before commenting any further.
Every entry in a matrix counts for the L2-spectral radius similarity. Suppose that A1,…,Ar,B1,…,Br are real n×n-matrices. Set A⊗2=A⊗A. Define the L2-spectral radius similarity between (A1,…,Ar) and (B1,…,Br) to be the number
ρ(A1⊗B1+⋯+Ar⊗Br)ρ(A⊗21+⋯+A⊗2r)1/2ρ(B⊗21+⋯+B⊗2r)1/2. Then the L2-spectral radius similarity is always a real number in the interval [0,1], so one can think of the L2-spectral radius similarity as a generalization of the value |⟨u,v⟩|∥u∥⋅∥v∥ where u,v are real or complex vectors. It turns out experimentally that if A1,…,Ar are random real matrices, and each Bj is obtained from Aj by replacing each entry in Bj with 0 with probability 1−α, then the L2-spectral radius similarity between (A1,…,Ar) and (B1,…,Br) will be about √α. If u=(A1,…,Ar),v=(B1,…,Br), then observe that |⟨u,v⟩|∥u∥⋅∥v∥≈√α as well.
Suppose now that A1,…,Ar are random real n×n matrices and C1,…,Cr are the m×m submatrices of A1,…,Ar respectively obtained by only looking at the first m rows and columns of A1,…,Ar. Then the L2-spectral radius similarity between A1,…,Ar and C1,…,Cr will be about √m/n. We can therefore conclude that in some sense C1,…,Cr is a simplified version of A1,…,Ar that more efficiently captures the behavior of A1,…,Ar than B1,…,Br does.
If A1,…,Ar,B1,…,Br are independent random matrices with standard Gaussian entries, then the L2-spectral radius similarity between (A1,…,Ar) and (B1,…,Br) will be about 1/√r with small variance. If u,v are random Gaussian vectors of length r, then |⟨u,v⟩|∥u∥⋅∥v∥ will on average be about c/√r for some constant c, but |⟨u,v⟩|∥u∥⋅∥v∥ will have a high variance.
These are some simple observations that I have made about the spectral radius during my research for evaluating cryptographic functions for cryptocurrency technologies.
Your notation is confusing me. If r is the size of the list of matrices, then how can you have a probability of 1-r for r>=2? Maybe you mean 1-1/r and sqrt{1/r} instead of 1-r and sqrt{r} respectively?
Thanks for pointing that out. I have corrected the typo. I simply used the symbol r for two different quantities, but now the probability is denoted by the symbol α.
I have originally developed a machine learning notion which I call an LSRDR (L2,d
-spectral radius dimensionality reduction), and LSRDRs (and similar machine learning models) behave mathematically and they have a high level of interpretability which should be good for AI safety. Here, I am giving an example of how LSRDRs behave mathematically and how one can get the most out of interpreting an LSRDR.
Suppose that n is a natural number. Let N denote the quantum channel that takes an n qubit quantum state and selects one of those qubits at random and send that qubit through the completely depolarizing channel (the completely depolarizing channel takes a state as input and returns the completely mixed state as an output).
If A1,…,Ar,B1,…,Br are complex matrices, then define superoperators Φ(A1,…,Ar) and Γ(A1,…,Ar;B1,…,Br) by setting
Φ(A1,…,Ar)(X)=∑rk=1AkXA∗k and Γ(A1,…,Ar;B1,…,Br)=∑rk=1AkXB∗k for all X.
Given tuples of matrices (A1,…,Ar),(B1,…,Br), define the L_2-spectral radius similarity between these tuples of matrices by setting
∥∥(A1,…,Ar)≃(B1,…,Br)∥2
=ρ(Γ(A1,…,Ar;B1,…,Br))ρ(Φ(A1,…,Ar))1/2ρ(Φ(B1,…,Br))1/2.
Suppose now that A1,…,A4n are matrices where N=Φ(A1,…,A4n). Let 1≤d≤n. We say that a tuple of complex d by d matrices (X1,…,X4n) is an LSRDR of A1,…,A4n if the quantity ∥(A1,…,A4n)≃(X1,…,X4n)∥2 is locally maximized.
Suppose now that X1,…,X4n are complex 2×2-matrices and (X1,…,X4n) is an LSRDR of A1,…,A4n. Then my computer experiments indicate that there will be some constant λ where λΓ(A1,…,A4n;X1,…,X4n) is similar to a positive semidefinite operator with eigenvalues {0,…,n+1} and where the eigenvalue j has multiplicity 3⋅C(n−1,k)+C(n−1,k−2) where C(⋅,⋅) denotes the binomial coefficient. I have not had a chance to try to mathematically prove this. Hooray. We have interpreted the LSRDR (X1,…,X4n) of (A1,…,A4n), and I have plenty of other examples of interpreted LSRDRs.
We also have a similar pattern for the spectrum of Φ(A1,…,A4n). My computer experiments indicate that there is some constant λ where λ⋅Φ(A1,…,A4n) has spectrum {0,…,n} where the eigenvalue j has multiplicity 3n−j⋅C(n,j).
The notion of the linear regression is an interesting machine learning algorithm in the sense that it can be studied mathematically, but the notion of a linear regression is a quite limited machine learning algorithm as most relations are non-linear. In particular, the linear regression does not give us any notion of any uncertainty in the output.
One way to extend the notion of the linear regression to encapsulate uncertainty in the outputs is to regress a function not to a linear transformation mapping vectors to vectors, but to regress the function to a transformation that maps vectors to mixed states. And the notion of a quantum channel is an appropriate transformation that maps vectors to mixed states. One can optimize this quantum channel using gradient ascent.
For this post, I will only go through some basic facts about quantum information theory. The reader is referred to the book The Theory of Quantum Information by John Watrous for all the missing details.
Let V be a complex Euclidean space. Let L(V) denote the vector space of linear operators from V to V. Given complex Euclidean spaces V,W, we say that a linear operator E from L(V) to L(W) is a trace preserving if Tr(E(X))=Tr(X)
for all X, and we say that E is completely positive if there are linear transformations A1,...,Ar where E(X)=A1XA∗1+⋯+ArXA∗r for all X; the value r is known as the Choi rank of E. A completely positive trace preserving operator is known as a quantum channel.
The collection of quantum channels from L(V) to L(W) is compact and convex.
If W is a complex Euclidean space, then let Dp(W)
denote the collection of pure states in W. Dp(W)
can be defined either as the set of unit vector in W modulo linear dependence, or Dp(W)
can be also defined as the collection of positive semidefinite rank-1 operators on W with trace 1.
Given complex Euclidean spaces U,V and a (multi) set of r distinct ordered pairs of unit vectors f={(u1,v1),…,(un,vn)}⊆U×V, and given a quantum channel
E:L(U)→L(V), we define the fitness level F(f,E)=∑rk=1log(E(uku∗k)vk,vk⟩ and the loss level L(f,E)=∑rk=1−log(E(uku∗k)vk,vk⟩.
We may locally optimize E to minimize its loss level using gradient descent, but there is a slight problem. The set of quantum channels spans the L(L(U),L(V)) which has dimension Dim(U)2⋅Dim(V)2. Due to the large dimension, any locally optimal E will contain Dim(U)2⋅Dim(V)2 many parameters, and this is a large quantity of parameters for what is supposed to be just a glorified version of a linear regression. Fortunately, instead of taking all quantum channels into consideration, we can limit the scope the quantum channels of limited Choi rank.
Empirical Observation: Suppose that U,V are complex Euclidean spaces, f⊆U×V is finite and r is a positive integer. Then computer experiments indicate that there is typically only one quantum channel E:L(U)→L(V) of Choi rank at most r where L(f,E) is locally minimized. More formally, if we run the experiment twice and produce two quantum channels E1,E2 where L(f,Ej) is locally minimized for j∈{1,2}, then we would typically have E1=E2. We therefore say that when L(f,E) is minimized, E is the best Choi rank r quantum channel approximation to f.
Suppose now that f={(u1,v1),…,(un,vn)}⊆Dp(U)×Dp(V) is a multiset. Then we would ideally like to approximate the function f better by alternating between the best Choi rank r quantum channel approximation and a non-linear mapping. An ideal choice of a non-linear but partial mapping is the function DE that maps a positive semidefinite matrix P to its (equivalence class of) unit dominant eigenvector.
Empirical observation: If f={(u1,v1),…,(un,vn)}⊆Dp(U)×Dp(V) and E is the best Choi rank r quantum channel approximation to f, then let u♯j=DE(E(uju∗j)) for all j, and define f♯={(u♯1,v1),…,(u♯n,vn)}. Let U be a small open neighborhood of f♯. Let g∈U. Then we typically have g♯♯=g♯. More generally, the best Choi rank r quantum channel approximation to g is typically the identity function.
From the above observation, we see that the vector u♯j is an approximation of vj that cannot be improved upon. The mapping DE∘E:Dp(U)→Dp(V) is therefore a trainable approximation to the mapping f and since Dp(U),Dp(V) are not even linear spaces (these are complex projective spaces with non-trivial homology groups), the mapping DE∘E is a non-linear model for the function to f.
I have been investigating machine learning models similar to DE∘E for cryptocurrency research and development as these sorts of machine learning models seem to be useful for evaluating the cryptographic security of some proof-of-work problems and other cryptographic functions like block ciphers and hash functions. I have seen other machine learning models that behave about as mathematically as DE∘E.
I admit that machine learning models like DE∘E are currently far from being as powerful as deep neural networks, but since DE∘E behaves mathematically, the model DE∘E should be considered as a safer and interpretable AI model. The goal is to therefore develop models that are mathematical like DE∘E but which can perform more and more machine learning tasks.
(Edited 8/14/2024)
Here is an example of what might happen. Suppose that for each uj, we select a orthonormal basis ej,1,…,ej,s of unit vectors for V. Let R={(uj,ej,k):1≤j≤n,1≤k≤s}. Then
Then for each quantum channel E, by the concavity of the logarithm function (which is the arithmetic-geometric mean inequality), we have
L(R,E)=∑nj=1∑nk=1−log(E(uju∗j)ej,k,ej,k⟩)
≤∑nj=1−log(∑nk=1⟨E(uju∗j)ej,k,ej,k⟩)
=∑nj=1−log(Tr(E)). Here, equality is reached if and only if
E(uju∗j)ej,k,ej,k⟩=E(uju∗j)ej,l,ej,l⟩ for each j,k,l, but this equality can be achieved by the channel
defined by E(X)=Tr(X)⋅I/s which is known as the completely depolarizing channel. This is the channel that always takes a quantum state and returns the completely mixed state. On the other hand, the channel E has maximum Choi rank since the Choi representation of E is just the identity function divided by the rank. This example is not unexpected since for each input of R the possible outputs span the entire space V evenly, so one does not have any information about the output from any particular input except that we know that the output could be anything. This example shows that the channels that locally minimize the loss function L(R,E) are the channels that give us a sort of linear regression of R but where this linear regression takes into consideration uncertainty in the output so the regression of a output of a state is a mixed state rather than a pure state.
We can use the L2−spectral radius similarity to measure more complicated similarities between data sets.
Suppose that A1,…,Ar are m×m-real matrices and B1,…,Br are n×n-real matrices. Let ρ(A) denote the spectral radius of A and let A⊗B denote the tensor product of A with B. Define the L2-spectral radius by setting ρ2(A1,…,Ar)=ρ(A1⊗A1+⋯+Ar⊗Ar)1/2, Define the L2-spectral radius similarity between A1,…,Ar and B1,…,Br as
∥(A1,…,Ar)≃(B1,…,Br)∥2=ρ(A1⊗B1+⋯+Ar⊗Br)ρ2(A1,…,Ar)ρ2(B1,…,Br).
We observe that if C is invertible and λ is a constant, then
∥(A1,…,Ar)≃(λCB1C−1,…,λCBrC−1)∥2=1.
Therefore, the L2-spectral radius is able to detect and measure symmetry that is normally hidden.
Example: Suppose that u1,…,ur;v1,…,vr are vectors of possibly different dimensions. Suppose that we would like to determine how close we are to obtaining an affine transformation T with T(uj)=vj for all j (or a slightly different notion of similarity). We first of all should normalize these vectors to obtain vectors x1,…,xr;y1,…,yr with mean zero and where the covariance matrix is the identity matrix (we may not need to do this depending on our notion of similarity). Then ∥(x1x∗1,…,xrx∗r)≃(y1y∗1,…,yry∗r)∥2 is a measure of low close we are to obtaining such an affine transformation T. We may be able to apply this notion to determining the distance between machine learning models. For example, suppose that M,N are both the first few layers in a (typically different) neural network. Suppose that a1,…,ar is a set of data points. Then if uj=M(aj) and vj=M(aj), then ∥(x1x∗1,…,xrx∗r)≃(y1y∗1,…,yry∗r)∥2 is a measure of the similarity between M and N.
I have actually used this example to see if there is any similarity between two different neural networks trained on the same data set. For my experiment, I chose a random collection of S⊆{0,1}32×{0,1}32 of ordered pairs and I trained the neural networks M,N to minimize the expected losses E(∥N(a)−b∥2:(a,b)∈S),E(∥M(a)−b∥2:(a,b)∈S). In my experiment, each aj was a random vector of length 32 whose entries were 0′s and 1′s. In my experiment, the similarity ∥(x1x∗1,…,xrx∗r)≃(y1y∗1,…,yry∗r)∥2 was worse than if x1,…,xr,y1,…,yr were just random vectors.
This simple experiment suggests that trained neural networks retain too much random or pseudorandom data and are way too messy in order for anyone to develop a good understanding or interpretation of these networks. In my personal opinion, neural networks should be avoided in favor of other AI systems, but we need to develop these alternative AI systems so that they eventually outperform neural networks. I have personally used the L2-spectral radius similarity to develop such non-messy AI systems including LSRDRs, but these non-neural non-messy AI systems currently do not perform as well as neural networks for most tasks. For example, I currently cannot train LSRDR-like structures to do any more NLP than just a word embedding, but I can train LSRDRs to do tasks that I have not seen neural networks perform (such as a tensor dimensionality reduction).
So in my research into machine learning algorithms that I can use to evaluate small block ciphers for cryptocurrency technologies, I have just stumbled upon a dimensionality reduction for tensors in tensor products of inner product spaces that according to my computer experiments exists, is unique, and which reduces a real tensor to another real tensor even when the underlying field is the field of complex numbers. I would not be too surprised if someone else came up with this tensor dimensionality reduction before since it has a rather simple description and it is in a sense a canonical tensor dimensionality reduction when we consider tensors as homogeneous non-commutative polynomials. But even if this tensor dimensionality reduction is not new, this dimensionality reduction algorithm belongs to a broader class of new algorithms that I have been studying recently such as LSRDRs.
Suppose that K is either the field of real numbers or the field of complex numbers. Let V1,…,Vn be finite dimensional inner product spaces over K with dimensions d1,…,dn respectively. Suppose that Vi has basis ei,1,…,ei,di. Given v∈V1⊗⋯⊗Vn, we would sometimes want to approximate the tensor v with a tensor that has less parameters. Suppose that (m0,…,mn) is a sequence of natural numbers with m0=mn=1. Suppose that Xi,j is a mi−1×mi matrix over the field K for 1≤i≤n and 1≤j≤di. From the system of matrices (Xi,j)i,j, we obtain a tensor T((Xi,j)i,j)=∑i1,…,inei1⊗⋯⊗ein⋅X1,i1…Xn,in. If the system of matrices (Xi,j)i,j locally minimizes the distance ∥v−T((Xi,j)i,j)∥, then the tensor T((Xi,j)i,j) is a dimensionality reduction of v which we shall denote by u.
Intuition: One can associate the tensor product V1⊗⋯⊗Vn with the set of all degree n homogeneous non-commutative polynomials that consist of linear combinations of the monomials of the form x1,i1⋯xn,in. Given, our matrices Xi,j, we can define a linear functional ϕ:V1⊗⋯⊗Vn→K by setting ϕ(p)=p((Xi,j)i,j). But by the Reisz representation theorem, the linear functional ϕ is dual to some tensor in V1⊗⋯⊗Vn. More specifically, ϕ is dual to T((Xi,j)i,j). The tensors of the form T((Xi,j)i,j) are therefore the
Advantages:
In my computer experiments, the reduced dimension tensor u is often (but not always) unique in the sense that if we calculate the tensor u twice, then we will get the same tensor. At least, the distribution of reduced dimension tensors u will have low Renyi entropy. I personally consider the partial uniqueness of the reduced dimension tensor to be advantageous over total uniqueness since this partial uniqueness signals whether one should use this tensor dimensionality reduction in the first place. If the reduced tensor is far from being unique, then one should not use this tensor dimensionality reduction algorithm. If the reduced tensor is unique or at least has low Renyi entropy, then this dimensionality reduction works well for the tensor v.
This dimensionality reduction does not depend on the choice of orthonormal basis ei,1,…,ei,di. If we chose a different basis for each Vi, then the resulting tensor u of reduced dimensionality will remain the same (the proof is given below).
If K is the field of complex numbers, but all the entries in the tensor v happen to be real numbers, then all the entries in the tensor u will also be real numbers.
This dimensionality reduction algorithm is intuitive when tensors are considered as homogeneous non-commutative polynomials.
Disadvantages:
This dimensionality reduction depends on a canonical cyclic ordering the inner product spaces V1,…,Vn.
Other notions of dimensionality reduction for tensors such as the CP tensor dimensionality reduction and the Tucker decompositions are more well-established, and they are obviously attempted generalizations of the singular value decomposition to higher dimensions, so they may be more intuitive to some.
The tensors of reduced dimensionality T((Xi,j)i,j) have a more complicated description than the tensors in the CP tensor dimensionality reduction.
Proposition: The set of tensors of the form ∑i1,…,ine1,i1⊗⋯⊗en,inX1,i1…Xn,in does not depend on the choice of bases (ei,1,…,ei,di)i.
Proof: For each i, let fi,1,…,fi,di be an alternative basis for Vi. Then suppose that ei,j=∑kui,j,kfi,k for each i,j. Then
∑i1,…,ine1,i1⊗⋯⊗en,inX1,i1…Xn,in
=∑i1,…,in∑k1u1,i1,k1f1,i1⊗⋯⊗∑knun,in,knfn,inX1,i1…Xn,in
=∑k1,…,knf1,k1⊗⋯⊗fn,kn∑i1,…,inu1,i1,k1…un,in,knX1,i1…Xn,i,n
=∑k1,…,knf1,k1⊗⋯⊗fn,kn(∑i1u1,i1,k1X1,i1)…(∑inun,in,knXin). Q.E.D.
A failed generalization: An astute reader may have observed that if we drop the requirement that mn=1, then we get a linear functional defined by letting
ϕ(p)=Tr(p((Xi,j)i,j)). This is indeed a linear functional, and we can try to approximate v using a the dual to ϕ, but this approach does not work as well.
In this note, I will continue to demonstrate not only the ways in which LSRDRs (L2,d-spectral radius dimensionality reduction) are mathematical but also how one can get the most out of LSRDRs. LSRDRs are one of the types of machine learning that I have been working on, and LSRDRs have characteristics that tell us that LSRDRs are often inherently interpretable which should be good for AI safety.
Suppose that N is the quantum channel that maps a n qubit state to a n qubit state where we select one of the 6 qubits at random and send it through the completely depolarizing channel (the completely depolarizing channel takes a state as an input and returns the completely mixed state as an output). Suppose that A1,…,A4n are 2n by 2n matrices where N has the Kraus representation N(X)=∑4nk=1AkXA∗k.
The objective is to locally maximize the fitness level ρ(∑4nk=1zkAk)/∥(z1,…,z4n)∥ where the norm in question is the Euclidean norm and where ρ denotes the spectral radius. This is a 1 dimensional case of an LSRDR of the channel N.
Let A=∑4nk=1zkAk when (z1,…,z4n) is selected to locally maximize the fitness level. Then my empirical calculations show that there is some λ where λ∑4nk=1zkAkis positive semidefinite with eigenvalues {0,…,n} and where the eigenvalue k has multiplicity (nk) which is the binomial coefficient. But these are empirical calculations for select values λ; I have not been able to mathematically prove that this is always the case for all local maxima for the fitness level (I have not tried to come up with a proof).
Here, we have obtained a complete characterization of A up-to-unitary equivalence due to the spectral theorem, so we are quite close to completely interpreting the local maximum for our fitness function.
I made a few YouTube videos showcasing the process of maximizing the fitness level here.
Spectra of 1 dimensional LSRDRs of 6 qubit noise channel during training
Spectra of 1 dimensional LSRDRs of 7 qubit noise channel during training
Spectra of 1 dimensional LSRDRs of 8 qubit noise channel during training
I will make another post soon about more LSRDRs of a higher dimension of the same channel N.
I personally like my machine learning algorithms to behave mathematically especially when I give them mathematical data. For example, a fitness function with apparently one local maximum value is a mathematical fitness function. It is even more mathematical if one can prove mathematical theorems about such a fitness function or if one can completely describe the local maxima of such a fitness function. It seems like fitness functions that satisfy these mathematical properties are more interpretable than the fitness functions which do not, so people should investigate such functions for AI safety purposes.
My notion of an LSRDR is a notion that satisfies these mathematical properties. To demonstrate the mathematical behavior of LSRDRs, let’s see what happens when we take an LSRDR of the octonions.
Let K denote either the field of real numbers or the field of complex numbers (K
could also be the division ring of quaternions, but for simplicity, let’s not go there). If A1,…,Ar are n×n-matrices over K, then an LSRDR (L2,d-spectral radius dimensionality reduction) of A1,…,Ar is a collection X1,…,Xr of d×d-matrices that locally maximizes the fitness level
ρ(A1⊗¯¯¯¯¯¯X1+⋯+Ar⊗¯¯¯¯¯¯Xr)ρ(X1⊗¯¯¯¯¯¯X1+⋯+Xr⊗¯¯¯¯¯¯Xr)1/2. ρ denotes the spectral radius function while ⊗ denotes the tensor product and ¯¯¯¯Z denotes the matrix obtained from Z by replacing each entry with its complex conjugate. We shall call the maximum fitness level the L2,d-spectral radius of A1,…,Ar over the field K, and we shall wrote ρK2,d(A1,…,Ar) for this spectral radius.
Define the linear superoperator Γ(A1,…,Ar;X1,…,Xr) by setting
Γ(A1,…,Ar;X1,…,Xr)(X)=A1XX∗1+⋯+ArXX∗r and set Φ(X1,…,Xr)=Γ(X1,…,Xr;X1,…,Xr). Then the fitness level of X1,…,Xr is ρ(Γ(A1,…,Ar;X1,…,Xr))Φ(X1,…,Xr)1/2.
Suppose that V is an 8-dimensional real inner product space. Then the octonionic multiplication operation is the unique up-to-isomorphism bilinear binary operation ∗ on V together with a unit 1 such that∥x∗y∥=∥x∥⋅∥y∥ and 1∗x=x∗1=1 for all x,y∈V. If we drop the condition that the octonions have a unit, then we do not quite have this uniqueness result.
We say that an octonion-like algbera is a 8-dimensional real inner product space V together with a unique up-to-isomorphism bilinear operation ∗ such that ∥x∗y∥=∥x∥⋅∥y∥ for all x,y.
Let V be a specific octonion-like algebra.
Suppose now that e1,…,e8 is an orthonormal basis for V (this does not need to be the standard basis). Then for each j∈{1,…,8}, let Aj be the linear operator from V to V defined by setting Ajv=ej∗v for all vectors v. All non-zero linear combinations of A1,…,A8 are conformal mappings (this means that they preserve angles). Now that we have turned the octonion-like algebra into matrices, we can take an LSRDR of the octonion-like algebras, but when taking the LSRDR of octonion-like algebras, we should not worry about the choice of orthonormal basis e1,…,e8 since I could formulate everything in a coordinate-free manner.
Empirical Observation from computer calculations: Suppose that 1≤d≤8 and K is the field of real numbers. Then the following are equivalent.
The d×d matrices X1,…,X8 are a LSRDR of A1,…,A8 over K where A1⊗X1+⋯+A8⊗X8 has a unique real dominant eigenvalue.
There exists matrices R,S where Xj=RAjS for all j and where SR is an orthonormal projection matrix.
In this case, ρK2,d(A1,…,A8)=√d and this fitness level is reached by the matrices X1,…,X8 in the above equivalent statements. Observe that the superoperator Γ(A1,…,A8;PA1P,…,PA8P) is similar to a direct sum of Γ(A1,…,Ar;X1,…,Xr)) and a zero matrix. But the projection matrix P is a dominant eigenvector of Γ(A1,…,A8;PA1P,…,PA8P) and ofΦ(PA1P,…,PA8P) as well.
I have no mathematical proof of the above fact though.
Now suppose that K=C. Then my computer calculations yield the following complex L2,d-spectral radii: (ρK2,j(A1,…,A8))8j=1
=(2,4,2+√8,5.4676355784...,6.1977259251...,4+√8,7.2628726081...,8)
Each time that I have trained a complex LSRDR of A1,…,A8, I was able to find a fitness level that is not just a local optimum but also a global optimum.
In the case of the real LSRDRs, I have a complete description of the LSRDRs of (A1,…,A8). This demonstrates that the octonion-like algebras are elegant mathematical structures and that LSRDRs behave mathematically in a manner that is compatible with the structure of the octonion-like algebras.
I have made a few YouTube videos that animate the process of gradient ascent to maximize the fitness level.
Edit: I have made some corrections to this post on 9/22/2024.
Fitness levels of complex LSRDRs of the octonions (youtube.com)
There are some cases where we have a complete description for the local optima for an optimization problem. This is a case of such an optimization problem.
Such optimization problems are useful for AI safety since a loss/fitness function where we have a complete description of all local or global optima is a highly interpretable loss/fitness function, and so one should consider using these loss/fitness functions to construct AI algorithms.
Theorem: Suppose that U is a real,complex, or quaternionic n×n-matrix that minimizes the quantity ∥U∥2+∥U−1∥2. Then U is unitary.
Proof: The real case is a special case of a complex case, and by representing each n×n-quaternionic matrix as a complex 2n×2n-matrix, we may assume that U is a complex matrix.
By the Schur decomposition, we know that U=VTV∗ where V is a unitary matrix and T is upper triangular. But we know that ∥U∥2=∥T∥2. Furthermore, U−1=VT−1V∗, so ∥U−1∥2=∥T−1∥2. Let D denote the diagonal matrix whose diagonal entries are the same as T. Then ∥T∥2≥∥D∥2 and ∥T−1∥2≥∥D−1∥2. Furthermore, ∥T∥2=∥D∥2 iff T is diagonal and ∥T−1∥2=∥D−1∥2 iff D is diagonal. Therefore, since ∥U∥2+∥U−1∥2=∥T∥2+∥T−1∥2 and ∥T∥2+∥T−1∥2 is minimized, we can conclude that T=D, so T is a diagonal matrix. Suppose that T has diagonal entries (z1,…,zn). By the arithmetic-geometric mean equality and the Cauchy-Schwarz inequality, we know that 12⋅(∥(z1,…,zn)∥2+∥(z−11,…,z−1n)∥2)≥∥(|z1|,…,|zn|)∥2⋅∥(|z−11|,…,|z−1n)|∥2
≥⟨(|z1|,…,|zn|),(|z−11|,…,|z−1n)|⟩=√n.
Here, the equalities hold if and only if |zj|=1 for all j, but this implies that U is unitary. Q.E.D.
The L2-spectral radius similarity is not transitive. Suppose that A1,…,Ar are m×m-matrices and B1,…,Br are real n×n-matrices. Then define ρ2(A1,…,Ar)=ρ(A1⊗A1+⋯+Ar⊗Ar)1/2. Then the generalized Cauchy-Schwarz inequality is satisfied:
ρ(A1⊗B1+⋯+Ar⊗Br)≤ρ2(A1,…,Ar)ρ2(B1,…,Br).
We therefore define the L2,d-spectral radius similarity between (A1,…,Ar) and (B1,…,Br) as ∥(A1,…,Ar)≃(B1,…,Br)∥=ρ(A1⊗B1+⋯+Ar⊗Br)ρ2(A1,…,Ar)ρ2(B1,…,Br). One should think of the L2-spectral radius similarity as a generalization of the cosine similarity ⟨u,v⟩∥u∥⋅∥v∥ between vectors u,v. I have been using the L2-spectral radius similarity to develop AI systems that seem to be very interpretable. The L2-spectral radius similarity is not transitive.
∥(A1,…,Ar)≃(A1⊕B1,…,Ar⊕Br)∥=1 and
∥(B1,…,Br)≃(A1⊕B1,…,Ar⊕Br)∥=1, but ∥(A1,…,Ar)≃(B1,…,Br)∥ can take any value in the interval [0,1].
We should therefore think of the L2,d-spectral radius similarity as a sort of least upper bound of [0,1]-valued equivalence relations than a [0,1]-valued equivalence relation. We need to consider this as a least upper bound because matrices have multiple dimensions.
Notation: ρ(A)=limn→∞∥An∥1/n is the spectral radius. The spectral radius A is the largest magnitude of an eigenvalue of the matrix A. Here the norm does not matter because we are taking the limit.A⊕B is the direct sum of matrices while A⊗B denotes the Kronecker product of matrices.
Let’s compute some inner products and gradients.
Set up: Let K denote either the field of real or the field of complex numbers. Suppose that d1,…,dr are positive integers. Let m0,…,mn be a sequence of positive integers with m0=mn=1. Suppose that Xi,j is an mi−1×mi-matrix whenever 1≤j≤di. Then from the matrices Xi,j, we can define a d1×⋯×dr-tensor T((Xi,j)i,j)=(X1,i1…Xn,in)i1,…,in. I have been doing computer experiments where I use this tensor to approximate other tensors by minimizing the ℓ2-distance. I have not seen this tensor approximation algorithm elsewhere, but perhaps someone else has produced this tensor approximation construction before. In previous shortform posts on this site, I have given evidence that the tensor dimensionality reduction behaves well, and in this post, we will focus on ways to compute with the tensors T((Xi,j)i,j), namely the inner product of such tensors and the gradient of the inner product with respect to the matrices (Xi,j)i,j.
Notation: If A1,…,Ar,B1,…,Br are matrices, then let Γ(A1,…,Ar;B1,…,Br) denote the superoperator defined by letting Γ(A1,…,Ar;B1,…,Br)(X)=A1XB∗1+⋯+ArXB∗r. Let Φ(A1,…,Ar)=Γ(A1,…,Ar;A1,…,Ar).
Inner product: Here is the computation of the inner product of our tensors.
⟨T((Ai,j)i,j),T((Bi,j)i,j)⟩
=⟨(A1,i1…An,in)i1,…,in,(B1,i1…Bn,in)i1,…,in⟩
=∑i1,…,inA1,i1…An,in(B1,i1…Bn,in)∗
=∑i1,…,inA1,i1…An,inB∗n,in…B∗1,i1
=Γ(A1,1,…,A1,d1;B1,1,…,B1,d1)…Γ(An,1,…,An,dn;Bn,1,…,Bn,dn).
In particular, ∥T((Ai,j)i,j)∥2=Φ(A1,1,…,A1,d1)…Φ(An,1,…,An,dn).
Gradient: Observe that ∇XTr(AX)=AT. We will see shortly that the cyclicity of the trace is useful for calculating the gradient. And here is my manual calculation of the gradient of the inner product of our tensors.
∇Xα,β⟨T((Xi,j)i,j),T((Ai,j)i,j)⟩
=∇Xα,β∑i1,…,inX1,i1…Xn,inA∗n,in…A∗1,i1
=∇Xα,βTr(∑i1,…,inX1,i1…Xn,inA∗n,in…A∗1,i1)
=∇Xα,βTr(∑i1,…,inXα,iα…Xn,inA∗n,in…
A∗α+1,iα+1A∗α,iαA∗α−1,iα−1…A∗1,i1X1,i1…Xα−1,iα−1)
=∇Xα,βTr(∑iα+1,…,in,i1,…,iα−1Xα,β…Xn,inA∗n,in…
A∗α+1,iα+1A∗α,βA∗α−1,iα−1…A∗1,i1X1,i1…Xα−1,iα−1)
=(∑iα+1,…,in,i1,…,iα−1Xα+1,iα+1…Xn,inA∗n,in…
A∗α+1,iα+1A∗α,βA∗α−1,iα−1…A∗1,i1X1,i1…Xα−1,iα−1)T
=(∑iα+1,…,in,i1,…,iα−1Xα+1,iα+1…Xn,in
A∗n,in…A∗α+1,iα+1A∗α,βA∗α−1,iα−1…A∗1,i1X1,i1…Xα−1,iα−1)T
=[(Γ(Xα+1,1,…,Xα+1,dα+1;Aα+1,1,…,Aα+1,dα+1)⋯
Γ(Xn,1,…,Xn,dn;An,1,…,An,dn)(1))
A∗α,β
((Γ(A∗α−1,1,…,A∗α−1,dα−1;X∗α−1,1,…,X∗α−1,dα−1)…
Γ(A∗1,1,…,A∗1,d1;X∗1,1,…,X∗1,d1)(1))]T.
So in my research into machine learning algorithms, I have stumbled upon a dimensionality reduction algorithm for tensors, and my computer experiments have so far yielded interesting results. I am not sure that this dimensionality reduction is new, but I plan on generalizing this dimensionality reduction to more complicated constructions that I am pretty sure are new and am confident would work well.
Suppose that K is either the field of real numbers or the field of complex numbers. Suppose that d1,…,dn are positive integers and (m0,…,mn) is a sequence of positive integers with m0=mn=1. Suppose that Xi,j is an mi−1×mi-matrix whenever 1≤j≤di. Then define a tensor T((Xi,j))=(X1,i1…Xn,in)i1,…,in∈Kd1⊗⋯⊗Kdn.
If v∈Kd1⊗⋯⊗Kdn, and (Xi,j)i,j is a system of matrices that minimizes the value ∥v−T((Xi,j))∥, then T((Xi,j)i,j) is a dimensionality reduction of (Xi,j)i,j, and we shall denote let u denote the tensor of reduced dimension T((Xi,j)i,j). We shall call u a matrix table to tensor dimensionality reduction of type (m0,…,mn).
Observation 1: (Sparsity) If v is sparse in the sense that most entries in the tensor v are zero, then the tensor u will tend to have plenty of zero entries, but as expected, u will be less sparse than v.
Observation 2: (Repeated entries) If v is sparse and v=(xi1,…,in)i1,…,in and the set {xi1,…,in:i1,…,in} has small cardinality, then the tensor u will contain plenty of repeated non-zero entries.
Observation 3: (Tensor decomposition) Let v be a tensor. Then we can often find a matrix table to tensor dimensionality reduction u of type (m0,…,mn) so that v−u is its own matrix table to tensor dimensionality reduction.
Observation 4: (Rational reduction) Suppose that v is sparse and the entries in v are all integers. Then the value ∥u−v∥2 is often a positive integer in both the case when u has only integer entries and in the case when u has non-integer entries.
Observation 5: (Multiple lines) Let m be a fixed positive even number. Suppose that v is sparse and the entries in v are all of the form r⋅e2πin/m for some integer n and r≥0. Then the entries in u are often exclusively of the form r⋅e2πin/m as well.
Observation 6: (Rational reductions) I have observed a sparse tensor v all of whose entries are integers along with matrix table to tensor dimensionality reductions u1,u2 of v where ∥v−u1∥=3,∥v−u1∥=2,∥u2−u1∥=5.
This is not an exclusive list of all the observations that I have made about the matrix table to tensor dimensionality reduction.
From these observations, one should conclude that the matrix table to tensor dimensionality reduction is a well-behaved machine learning algorithm. I hope and expect this machine learning algorithm and many similar ones to be used to both interpret the AI models that we have and will have and also to construct more interpretable and safer AI models in the future.
Suppose that q,r,d are natural numbers. Let 1<p<∞. Let zi,j be a complex number whenever 1≤i≤q,1≤j≤r. Let L:Md(C)r∖{0}→[−∞,∞) be the fitness function defined by letting L(X1,…,Xr)=(∑qi=1log(ρ(∑rj=1zi,jXj))/q)−log(∥∑rj=1XjX∗j∥p)/2. Here, ρ(X) denotes the spectral radius of a matrix X while ∥X∥p denotes the Schatten p-norm of X.
Now suppose that (A1,…,Ar)∈Md(C)r∖{0} is a tuple that maximizes L(A1,…,Ar). Let M:Cr∖{0}→[−∞,∞) be the fitness function defined by letting M(w1,…,wr)=log(ρ(w1A1+⋯+wrAr))−log(∥(w1,…,wr)∥2). Then suppose that (v1,…,vr)∈Cr∖{0} is a tuple that maximizes M(v1,…,vr). Then we will likely be able to find an ℓ∈{1,…,q} and a non-zero complex number α where (v1,…,vr)=α⋅(xℓ,1,…,xℓ,r).
In this case, (zi,j)i,j represents the training data while the matrices A1,…,Ar is our learned machine learning model. In this case, we are able to recover some original data values from the learned machine learning model A1,…,Ar without any distortion to the data values.
I have just made this observation, so I am still exploring the implications of this observation. But this is an example of how mathematical spectral machine learning algorithms can behave, and more mathematical machine learning models are more likely to be interpretable and they are more likely to have a robust mathematical/empirical theory behind them.
I think that all that happened here was the matrices A1,…,Ar just ended up being diagonal matrices. This means that this is probably an uninteresting observation in this case, but I need to do more tests before commenting any further.