This is a post about some of the machine learning algorithms that I have been doing experiments with. These machine learning models behave quite mathematically which seems to be very helpful for AI interpretability and AI safety.
Sequences of matrices generally cannot be approximated by sequences of Hermitian matrices.
Suppose that are -complex matrices and are -complex matrices. Then define a mapping by for all . Define
. Define the
-spectral radius by setting . Define the -spectral radius similarity between and by
.
The -spectral radius similarity is always in the interval . if , generates the algebra of -complex matrices, and also generates the algebra of -complex matrices, then if and only if there are with for all .
Define to be the supremum of
where are -Hermitian matrices.
One can get lower bounds for simply by locally maximizing using gradient ascent, but if one locally maximizes this quantity twice, one typically gets the same fitness level.
Empirical observation/conjecture: If are -complex matrices, then whenever .
The above observation means that sequences of -matrices are fundamentally non-Hermitian. In this case, we cannot get better models of using Hermitian matrices larger than the matrices themselves; I kind of want the behavior to be more complex instead of doing the same thing whenever
, but the purpose of modeling as Hermitian matrices is generally to use smaller matrices and not larger matrices.
This means that the function behaves mathematically.
Now, the model is a linear model of since the mapping is the restriction of a linear mapping, so such a linear model should be good for a limited number of tasks, but the mathematical behavior of the model generalizes to multi-layered machine learning models.
In this post, we shall describe 3 related fitness functions with discrete domains where the process of maximizing these functions is pseudodeterministic in the sense that if we locally maximize the fitness function multiple times, then we typically attain the same local maximum; this appears to be an important aspect of AI safety. These fitness functions are my own. While these functions are far from deep neural networks, I think they are still related to AI safety since they are closely related to other fitness functions that are locally maximized pseudodeterministically that more closely resemble deep neural networks.
Let K denote a finite dimensional algebra over the field of real numbers together with an adjoint operation ∗ (the operation ∗ is a linear involution with (xy)∗=y∗x∗). For example, K could be the field of real numbers, complex numbers, quaternions, or a matrix ring over the reals, complex, or quaternions. We can extend the adjoint ∗ to the matrix ring Mr(K) by setting (xi,j)∗i,j=(x∗j,i)i,j.
Let n be a natural number. If A1,…,Ar∈Mn(K),X1,…,Xr∈Md(K), then define
Γ(A1,…,Ar;X1,…,Xr):Mn,d(K)→Mn,d(K) by setting Γ(A1,…,Ar;X1,…,Xr)(X)=A1XX∗1+⋯+ArXX∗r.
Suppose now that 1≤d<n. Then let Sd⊆Mn,n(K) be the set of all 0,1-diagonal matrices with d many 1’s on the diagonal. We observe that each element in Sd is an orthogonal projection. Define fitness functions Fd,Gd,Hd:Sd→R by setting
Fd(P)=ρ(Γ(A1,…,Ar;PA1P,…,PArP)),
Gd(P)=ρ(Γ(PA1P,…,PArP;PA1P,…,PArP)), and
Hd(P)=Fd(P)2Gd(P). Here, ρ denotes the spectral radius.
Fd(P) is typically slightly larger than Gd(P), so these three fitness functions are closely related.
If P,Q∈Sd, then we say that Q is in the neighborhood of P if Q differs from P by at most 2 entries. If F is a fitness function with domain Sd, then we say that (P,F(P)) is a local maximum of the function F if F(P)≥F(Q) whenever Q is in the neighborhood of P.
The path from initialization to a local maximum (Ps,F(Ps)) for will be a sequence (P0,…,Ps) where Pj is always in the neighborhood of Pj−1 and where F(Pj)≥F(Pj−1) for all j and the length of the path will be s and where P0 is generated uniformly randomly.
Empirical observation: Suppose that F∈{Fd,Gd,Hd}. If we compute a path from initialization to local maximum for F, then such a path will typically have length less than n. Furthermore, if we locally maximize F multiple times, we will typically obtain the same local maximum each time. Moreover, if PF,PG,PH are the computed local maxima of Fd,Gd,Hd respectively, then PF,PG,PH will either be identical or differ by relatively few diagonal entries.
I have not done the experiments yet, but one should be able to generalize the above empirical observation to matroids. Suppose that M is a basis matroid with underlying set {1,…,n} and where |A|=d for each A∈M. Then one should be able to make the same observation about the fitness functions Fd|M,Gd|M,Hd|M as well.
We observe that the problems of maximizing Fd,Gd,Hd are all NP-complete problems since the clique problems can be reduced to special cases of maximizing Fd,Gd,Hd. This means that the problems of maximizing Fd,Gd,Hd can be sophisticated problems, but this also means that we should not expect it to be easy to find the global maxima for Fd,Gd,Hd in some cases.