Bilinear layers—not confident at all! It might make structure more amenable to mathematical analysis so it might help? But as yet there aren’t any empirical interpretability wins that have come from bilinear layers.
Dictionary learning—This is one of my main bets for comprehensive interpretability.
Set a random variable XA to be a trained model with bilinear layers with random initialization and training data A. Then I would like to know if various estimated upper bounds for various entropies for XA are much lower than if XA were a more typical machine learning model where a linear layer is composed with ReLU. It seems like entropy is a good objective measure of the lack of decipherability.
Now that I actually think about it, I have some ideas about how we can cluster neurons together if we are using bilinear layers. Because of this, I am starting to like bilinear layers a bit more, and I am feeling much more confident about the problem of interpreting neural networks as long as the neural networks have an infrastructure that is suitable for interpretability. I am going to explain everything in terms of real-valued mappings, but everything I say can be extended to complex and quaternionic matrices (but one needs to be a little bit more careful about conjugations,transposes, and adjoints, so I will leave the complex and quaternionic cases as an exercise to the reader).
Suppose that A1,…,Ar are n×n-real symmetric matrices. Then define a mapping fA1,…,Ar:Rn→Rr by setting fA1,…,Ar(x)=⟨A1x,x⟩,…,⟨Arx,x⟩.
Now, given a collection A1,…,Ar of n×n-real matrices, define a partial mapping LA1,…,Ar;d:Md(R)r→[0,∞) by setting LA1,…,Ar;d(X1,…,Xr)=ρ(A1⊗X1+⋯+Ar⊗Xr)ρ(X1⊗X1+⋯+Xr⊗Xr)1/2 where ρ denotes the spectral radius and ⊗ denotes the tensor product. Then we say that (X1,…,Xr)∈Md(R)r is a real L2,d-spectral radius dimensionality reduction (LSRDR) if LA1,…,Ar;d(X1,…,Xr) is locally maximized. One can compute LSRDRs using a variant gradient ascent combined with the power iteration technique for finding the dominant left and right eigenvectors and eigenvalues of A1⊗X1+⋯+Ar⊗Xr and X1⊗X1+⋯+Xr⊗Xr.
If X1,…,Xr is an LSRDR of A1,…,Ar, then you should be able to find real matrices R,S where Xj=RAjS for 1≤j≤r. Furthermore, there should be a constant α where RS=αId. We say that the LSRDR X1,…,Xr is normalized if α=1, so let’s assume that X1,…,Xr is a normalized LSRDR. Then define P=SR. Then P should be a (not-necessarily orthogonal, so P2=P but we could have P≠PT) projection matrix of rank d. If A1,…,Ar are all symmetric, then the matrix P should be an orthogonal projection. The vector space im(P) will be a cluster of neurons. We can also determine which elements of this cluster are most prominent.
Now, define a linear superoperator Γ(A1,…,Ar;X1,…,Xr):Mn,d(R)→Mn,d(R) by setting Γ(A1,…,Ar;X1,…,Xr)(X)=A1XXT1+⋯+ArXXTr and set Γ(A1,…,Ar;X1,…,Xr)T=Γ(AT1,…,ATr;XT1,…,XTr) which is the adjoint of Γ(A1,…,Ar;X1,…,Xr) where we endow Mn,d(R) with the Frobenius inner product. Let UR denote a dominant eigenvector of Γ(A1,…,Ar;X1,…,Xr) and let UL denote a dominant eigenvector ofΓ(A1,…,Ar;X1,…,Xr)T. Then after multiplying UR,UL by constant real factors, the matrices UR⋅ST,UL⋅R will be (typically distinct) positive definite trace 1 matrices of rank d with im(P)=im(UR⋅ST)=im(UL⋅R). If we retrained the LSRDR but with a different initialization, then the matrices P,UR⋅ST,UL⋅R will still remain the same.
If v∈Rn,∥v∥=1, then the values ⟨UR⋅STv,v⟩,⟨UL⋅Rv,v⟩ will be numbers in the in the interval [0,1] that measure how much the vector v belongs in the cluster.
If O is an r×r-orthogonal matrix, then the matrices P,UR⋅ST,UL⋅R will remain the same if they were trained on O∘fA1,…,Ar instead of fA1,…,Ar, so the matrices P,UR⋅ST,UL⋅R care about just the inner product space structure of Rr while ignoring any of the other structure of Rr. Let Pf=UR⋅ST,Qf=UL⋅R.
We can then use LSRDRs to compute the backpropagation of a cluster throughout the network.
Suppose that f=f1∘⋯∘fn where each fj is a bilinear layer. Then whenever fj:Rn→Rr is a bilinear mapping, and P∈Mr(R) is a positive semidefinite matrix that represents a cluster in Rr, the positive semidefinite matrices PP∘f,QP∘f represent clusters in Rn.
I have not compared LSRDRs to other techniques to other clustering and dimensionality reduction techniques such as higher order singular value decompositions, but I like LSRDRs since my computer calculations indicate that they are often unique.
A coordinate free perspective:
Suppose that V,W are real finite dimensional inner product spaces. Then we say that a function f:V→W is a quadratic form if for each g∈W∗, the mapping g∘f is a quadratic form. We say that a linear operator A:V→V⊗W is symmetric if for each w∈W∗, the operator (1V⊗w)A is symmetric. The quadratic forms f:V→W can be put into a canonical one-to-one correspondence with the symmetric linear operators A:V→V⊗W.
If A:U2→V2⊗W,B:U1→V1⊗W is an arbitrary linear operator, then define Γ(A,B):L(U1,U2)→L(V1,V2) by letting Γ(A,B)(X)=TrW(AXBT) where TrW denotes the partial trace.
Given a linear mapping A:V→V⊗W, and a d dimensional real inner product space U, define a partial mapping LA,U:L(U,U⊗W)→[0,∞) by setting LA,U(B)=ρ(Γ(A;B))ρ(Γ(B;B))1/2. We say that a linear mapping B:U→U⊗W is a real LSRDR of A if the value LA,U(B) is locally maximized. If B is a real LSRDR of A, one can as before (if everything goes right) find linear operators R,S and constant α where RS=α⋅1U and where B=(R⊗1W)AS. As before, we can normalize the LSRDR so that α=1. In this case, we can set UR to be a dominant eigenvector of Γ(A;B) and UL to be a dominant eigenvector of Γ(A;B)T. We still define P=SR and the mapping P will be a non-orthogonal projection, and UR⋅ST,UL⋅R will still be positive semidefinite (up-to a constant factor). The situation we are in is exactly as before except that we are working with abstract finite dimensional inner product spaces without any mention of coordinates.
I have thought of LSRDRs as machine learning models themselves (such as word embeddings), but it looks like LSRDRs may also be used to interpret machine learning models.
When generalizing bilinear layers to a quaternionic setting, do we want the layers to be linear in both variables or do we want them to be linear in one variable and anti-linear in the other variable?
How likely do you think bilinear layers & dictionary learning will lead to comprehensive interpretability?
Are there other specific areas you’re excited about?
Bilinear layers—not confident at all! It might make structure more amenable to mathematical analysis so it might help? But as yet there aren’t any empirical interpretability wins that have come from bilinear layers.
Dictionary learning—This is one of my main bets for comprehensive interpretability.
Other areas—I’m also generally excited by the line of research outlined in https://arxiv.org/abs/2301.04709
Set a random variable XA to be a trained model with bilinear layers with random initialization and training data A. Then I would like to know if various estimated upper bounds for various entropies for XA are much lower than if XA were a more typical machine learning model where a linear layer is composed with ReLU. It seems like entropy is a good objective measure of the lack of decipherability.
Now that I actually think about it, I have some ideas about how we can cluster neurons together if we are using bilinear layers. Because of this, I am starting to like bilinear layers a bit more, and I am feeling much more confident about the problem of interpreting neural networks as long as the neural networks have an infrastructure that is suitable for interpretability. I am going to explain everything in terms of real-valued mappings, but everything I say can be extended to complex and quaternionic matrices (but one needs to be a little bit more careful about conjugations,transposes, and adjoints, so I will leave the complex and quaternionic cases as an exercise to the reader).
Suppose that A1,…,Ar are n×n-real symmetric matrices. Then define a mapping fA1,…,Ar:Rn→Rr by setting fA1,…,Ar(x)=⟨A1x,x⟩,…,⟨Arx,x⟩.
Now, given a collection A1,…,Ar of n×n-real matrices, define a partial mapping LA1,…,Ar;d:Md(R)r→[0,∞) by setting LA1,…,Ar;d(X1,…,Xr)=ρ(A1⊗X1+⋯+Ar⊗Xr)ρ(X1⊗X1+⋯+Xr⊗Xr)1/2 where ρ denotes the spectral radius and ⊗ denotes the tensor product. Then we say that (X1,…,Xr)∈Md(R)r is a real L2,d-spectral radius dimensionality reduction (LSRDR) if LA1,…,Ar;d(X1,…,Xr) is locally maximized. One can compute LSRDRs using a variant gradient ascent combined with the power iteration technique for finding the dominant left and right eigenvectors and eigenvalues of A1⊗X1+⋯+Ar⊗Xr and X1⊗X1+⋯+Xr⊗Xr.
If X1,…,Xr is an LSRDR of A1,…,Ar, then you should be able to find real matrices R,S where Xj=RAjS for 1≤j≤r. Furthermore, there should be a constant α where RS=αId. We say that the LSRDR X1,…,Xr is normalized if α=1, so let’s assume that X1,…,Xr is a normalized LSRDR. Then define P=SR. Then P should be a (not-necessarily orthogonal, so P2=P but we could have P≠PT) projection matrix of rank d. If A1,…,Ar are all symmetric, then the matrix P should be an orthogonal projection. The vector space im(P) will be a cluster of neurons. We can also determine which elements of this cluster are most prominent.
Now, define a linear superoperator Γ(A1,…,Ar;X1,…,Xr):Mn,d(R)→Mn,d(R) by setting Γ(A1,…,Ar;X1,…,Xr)(X)=A1XXT1+⋯+ArXXTr and set Γ(A1,…,Ar;X1,…,Xr)T=Γ(AT1,…,ATr;XT1,…,XTr) which is the adjoint of Γ(A1,…,Ar;X1,…,Xr) where we endow Mn,d(R) with the Frobenius inner product. Let UR denote a dominant eigenvector of Γ(A1,…,Ar;X1,…,Xr) and let UL denote a dominant eigenvector ofΓ(A1,…,Ar;X1,…,Xr)T. Then after multiplying UR,UL by constant real factors, the matrices UR⋅ST,UL⋅R will be (typically distinct) positive definite trace 1 matrices of rank d with im(P)=im(UR⋅ST)=im(UL⋅R). If we retrained the LSRDR but with a different initialization, then the matrices P,UR⋅ST,UL⋅R will still remain the same.
If v∈Rn,∥v∥=1, then the values ⟨UR⋅STv,v⟩,⟨UL⋅Rv,v⟩ will be numbers in the in the interval [0,1] that measure how much the vector v belongs in the cluster.
If O is an r×r-orthogonal matrix, then the matrices P,UR⋅ST,UL⋅R will remain the same if they were trained on O∘fA1,…,Ar instead of fA1,…,Ar, so the matrices P,UR⋅ST,UL⋅R care about just the inner product space structure of Rr while ignoring any of the other structure of Rr. Let Pf=UR⋅ST,Qf=UL⋅R.
We can then use LSRDRs to compute the backpropagation of a cluster throughout the network.
Suppose that f=f1∘⋯∘fn where each fj is a bilinear layer. Then whenever fj:Rn→Rr is a bilinear mapping, and P∈Mr(R) is a positive semidefinite matrix that represents a cluster in Rr, the positive semidefinite matrices PP∘f,QP∘f represent clusters in Rn.
I have not compared LSRDRs to other techniques to other clustering and dimensionality reduction techniques such as higher order singular value decompositions, but I like LSRDRs since my computer calculations indicate that they are often unique.
A coordinate free perspective:
Suppose that V,W are real finite dimensional inner product spaces. Then we say that a function f:V→W is a quadratic form if for each g∈W∗, the mapping g∘f is a quadratic form. We say that a linear operator A:V→V⊗W is symmetric if for each w∈W∗, the operator (1V⊗w)A is symmetric. The quadratic forms f:V→W can be put into a canonical one-to-one correspondence with the symmetric linear operators A:V→V⊗W.
If A:U2→V2⊗W,B:U1→V1⊗W is an arbitrary linear operator, then define Γ(A,B):L(U1,U2)→L(V1,V2) by letting Γ(A,B)(X)=TrW(AXBT) where TrW denotes the partial trace.
Given a linear mapping A:V→V⊗W, and a d dimensional real inner product space U, define a partial mapping LA,U:L(U,U⊗W)→[0,∞) by setting LA,U(B)=ρ(Γ(A;B))ρ(Γ(B;B))1/2. We say that a linear mapping B:U→U⊗W is a real LSRDR of A if the value LA,U(B) is locally maximized. If B is a real LSRDR of A, one can as before (if everything goes right) find linear operators R,S and constant α where RS=α⋅1U and where B=(R⊗1W)AS. As before, we can normalize the LSRDR so that α=1. In this case, we can set UR to be a dominant eigenvector of Γ(A;B) and UL to be a dominant eigenvector of Γ(A;B)T. We still define P=SR and the mapping P will be a non-orthogonal projection, and UR⋅ST,UL⋅R will still be positive semidefinite (up-to a constant factor). The situation we are in is exactly as before except that we are working with abstract finite dimensional inner product spaces without any mention of coordinates.
Conclusions:
The information that I have given here can be found in several articles that I have posted at https://circcashcore.com/blog/.
I have thought of LSRDRs as machine learning models themselves (such as word embeddings), but it looks like LSRDRs may also be used to interpret machine learning models.
When generalizing bilinear layers to a quaternionic setting, do we want the layers to be linear in both variables or do we want them to be linear in one variable and anti-linear in the other variable?