The notion of the linear regression is an interesting machine learning algorithm in the sense that it can be studied mathematically, but the notion of a linear regression is a quite limited machine learning algorithm as most relations are non-linear. In particular, the linear regression does not give us any notion of any uncertainty in the output.
One way to extend the notion of the linear regression to encapsulate uncertainty in the outputs is to regress a function not to a linear transformation mapping vectors to vectors, but to regress the function to a transformation that maps vectors to mixed states. And the notion of a quantum channel is an appropriate transformation that maps vectors to mixed states. One can optimize this quantum channel using gradient ascent.
For this post, I will only go through some basic facts about quantum information theory. The reader is referred to the book The Theory of Quantum Information by John Watrous for all the missing details.
Let V be a complex Euclidean space. Let L(V) denote the vector space of linear operators from V to V. Given complex Euclidean spaces V,W, we say that a linear operator E from L(V) to L(W) is a trace preserving if Tr(E(X))=Tr(X)
for all X, and we say that E is completely positive if there are linear transformations A1,...,Ar where E(X)=A1XA∗1+⋯+ArXA∗r for all X; the value r is known as the Choi rank of E. A completely positive trace preserving operator is known as a quantum channel.
The collection of quantum channels from L(V) to L(W) is compact and convex.
If W is a complex Euclidean space, then let Dp(W)
denote the collection of pure states in W. Dp(W)
can be defined either as the set of unit vector in W modulo linear dependence, or Dp(W)
can be also defined as the collection of positive semidefinite rank-1 operators on W with trace 1.
Given complex Euclidean spaces U,V and a (multi) set of r distinct ordered pairs of unit vectors f={(u1,v1),…,(un,vn)}⊆U×V, and given a quantum channel
E:L(U)→L(V), we define the fitness level F(f,E)=∑rk=1log(E(uku∗k)vk,vk⟩ and the loss level L(f,E)=∑rk=1−log(E(uku∗k)vk,vk⟩.
We may locally optimize E to minimize its loss level using gradient descent, but there is a slight problem. The set of quantum channels spans the L(L(U),L(V)) which has dimension Dim(U)2⋅Dim(V)2. Due to the large dimension, any locally optimal E will contain Dim(U)2⋅Dim(V)2 many parameters, and this is a large quantity of parameters for what is supposed to be just a glorified version of a linear regression. Fortunately, instead of taking all quantum channels into consideration, we can limit the scope the quantum channels of limited Choi rank.
Empirical Observation: Suppose that U,V are complex Euclidean spaces, f⊆U×V is finite and r is a positive integer. Then computer experiments indicate that there is typically only one quantum channel E:L(U)→L(V) of Choi rank at most r where L(f,E) is locally minimized. More formally, if we run the experiment twice and produce two quantum channels E1,E2 where L(f,Ej) is locally minimized for j∈{1,2}, then we would typically have E1=E2. We therefore say that when L(f,E) is minimized, E is the best Choi rank r quantum channel approximation to f.
Suppose now that f={(u1,v1),…,(un,vn)}⊆Dp(U)×Dp(V) is a multiset. Then we would ideally like to approximate the function f better by alternating between the best Choi rank r quantum channel approximation and a non-linear mapping. An ideal choice of a non-linear but partial mapping is the function DE that maps a positive semidefinite matrix P to its (equivalence class of) unit dominant eigenvector.
Empirical observation: If f={(u1,v1),…,(un,vn)}⊆Dp(U)×Dp(V) and E is the best Choi rank r quantum channel approximation to f, then let u♯j=DE(E(uju∗j)) for all j, and define f♯={(u♯1,v1),…,(u♯n,vn)}. Let U be a small open neighborhood of f♯. Let g∈U. Then we typically have g♯♯=g♯. More generally, the best Choi rank r quantum channel approximation to g is typically the identity function.
From the above observation, we see that the vector u♯j is an approximation of vj that cannot be improved upon. The mapping DE∘E:Dp(U)→Dp(V) is therefore a trainable approximation to the mapping f and since Dp(U),Dp(V) are not even linear spaces (these are complex projective spaces with non-trivial homology groups), the mapping DE∘E is a non-linear model for the function to f.
I have been investigating machine learning models similar to DE∘E for cryptocurrency research and development as these sorts of machine learning models seem to be useful for evaluating the cryptographic security of some proof-of-work problems and other cryptographic functions like block ciphers and hash functions. I have seen other machine learning models that behave about as mathematically as DE∘E.
I admit that machine learning models like DE∘E are currently far from being as powerful as deep neural networks, but since DE∘E behaves mathematically, the model DE∘E should be considered as a safer and interpretable AI model. The goal is to therefore develop models that are mathematical like DE∘E but which can perform more and more machine learning tasks.
Here is an example of what might happen. Suppose that for each uj, we select a orthonormal basis ej,1,…,ej,s of unit vectors for V. Let R={(uj,ej,k):1≤j≤n,1≤k≤s}. Then
Then for each quantum channel E, by the concavity of the logarithm function (which is the arithmetic-geometric mean inequality), we have
L(R,E)=∑nj=1∑nk=1−log(E(uju∗j)ej,k,ej,k⟩)
≤∑nj=1−log(∑nk=1⟨E(uju∗j)ej,k,ej,k⟩)
=∑nj=1−log(Tr(E)). Here, equality is reached if and only if
E(uju∗j)ej,k,ej,k⟩=E(uju∗j)ej,l,ej,l⟩ for each j,k,l, but this equality can be achieved by the channel
defined by E(X)=Tr(X)⋅I/s which is known as the completely depolarizing channel. This is the channel that always takes a quantum state and returns the completely mixed state. On the other hand, the channel E has maximum Choi rank since the Choi representation of E is just the identity function divided by the rank. This example is not unexpected since for each input of R the possible outputs span the entire space V evenly, so one does not have any information about the output from any particular input except that we know that the output could be anything. This example shows that the channels that locally minimize the loss function L(R,E) are the channels that give us a sort of linear regression of R but where this linear regression takes into consideration uncertainty in the output so the regression of a output of a state is a mixed state rather than a pure state.
The notion of the linear regression is an interesting machine learning algorithm in the sense that it can be studied mathematically, but the notion of a linear regression is a quite limited machine learning algorithm as most relations are non-linear. In particular, the linear regression does not give us any notion of any uncertainty in the output.
One way to extend the notion of the linear regression to encapsulate uncertainty in the outputs is to regress a function not to a linear transformation mapping vectors to vectors, but to regress the function to a transformation that maps vectors to mixed states. And the notion of a quantum channel is an appropriate transformation that maps vectors to mixed states. One can optimize this quantum channel using gradient ascent.
For this post, I will only go through some basic facts about quantum information theory. The reader is referred to the book The Theory of Quantum Information by John Watrous for all the missing details.
Let V be a complex Euclidean space. Let L(V) denote the vector space of linear operators from V to V. Given complex Euclidean spaces V,W, we say that a linear operator E from L(V) to L(W) is a trace preserving if Tr(E(X))=Tr(X)
for all X, and we say that E is completely positive if there are linear transformations A1,...,Ar where E(X)=A1XA∗1+⋯+ArXA∗r for all X; the value r is known as the Choi rank of E. A completely positive trace preserving operator is known as a quantum channel.
The collection of quantum channels from L(V) to L(W) is compact and convex.
If W is a complex Euclidean space, then let Dp(W)
denote the collection of pure states in W. Dp(W)
can be defined either as the set of unit vector in W modulo linear dependence, or Dp(W)
can be also defined as the collection of positive semidefinite rank-1 operators on W with trace 1.
Given complex Euclidean spaces U,V and a (multi) set of r distinct ordered pairs of unit vectors f={(u1,v1),…,(un,vn)}⊆U×V, and given a quantum channel
E:L(U)→L(V), we define the fitness level F(f,E)=∑rk=1log(E(uku∗k)vk,vk⟩ and the loss level L(f,E)=∑rk=1−log(E(uku∗k)vk,vk⟩.
We may locally optimize E to minimize its loss level using gradient descent, but there is a slight problem. The set of quantum channels spans the L(L(U),L(V)) which has dimension Dim(U)2⋅Dim(V)2. Due to the large dimension, any locally optimal E will contain Dim(U)2⋅Dim(V)2 many parameters, and this is a large quantity of parameters for what is supposed to be just a glorified version of a linear regression. Fortunately, instead of taking all quantum channels into consideration, we can limit the scope the quantum channels of limited Choi rank.
Empirical Observation: Suppose that U,V are complex Euclidean spaces, f⊆U×V is finite and r is a positive integer. Then computer experiments indicate that there is typically only one quantum channel E:L(U)→L(V) of Choi rank at most r where L(f,E) is locally minimized. More formally, if we run the experiment twice and produce two quantum channels E1,E2 where L(f,Ej) is locally minimized for j∈{1,2}, then we would typically have E1=E2. We therefore say that when L(f,E) is minimized, E is the best Choi rank r quantum channel approximation to f.
Suppose now that f={(u1,v1),…,(un,vn)}⊆Dp(U)×Dp(V) is a multiset. Then we would ideally like to approximate the function f better by alternating between the best Choi rank r quantum channel approximation and a non-linear mapping. An ideal choice of a non-linear but partial mapping is the function DE that maps a positive semidefinite matrix P to its (equivalence class of) unit dominant eigenvector.
Empirical observation: If f={(u1,v1),…,(un,vn)}⊆Dp(U)×Dp(V) and E is the best Choi rank r quantum channel approximation to f, then let u♯j=DE(E(uju∗j)) for all j, and define f♯={(u♯1,v1),…,(u♯n,vn)}. Let U be a small open neighborhood of f♯. Let g∈U. Then we typically have g♯♯=g♯. More generally, the best Choi rank r quantum channel approximation to g is typically the identity function.
From the above observation, we see that the vector u♯j is an approximation of vj that cannot be improved upon. The mapping DE∘E:Dp(U)→Dp(V) is therefore a trainable approximation to the mapping f and since Dp(U),Dp(V) are not even linear spaces (these are complex projective spaces with non-trivial homology groups), the mapping DE∘E is a non-linear model for the function to f.
I have been investigating machine learning models similar to DE∘E for cryptocurrency research and development as these sorts of machine learning models seem to be useful for evaluating the cryptographic security of some proof-of-work problems and other cryptographic functions like block ciphers and hash functions. I have seen other machine learning models that behave about as mathematically as DE∘E.
I admit that machine learning models like DE∘E are currently far from being as powerful as deep neural networks, but since DE∘E behaves mathematically, the model DE∘E should be considered as a safer and interpretable AI model. The goal is to therefore develop models that are mathematical like DE∘E but which can perform more and more machine learning tasks.
(Edited 8/14/2024)
Here is an example of what might happen. Suppose that for each uj, we select a orthonormal basis ej,1,…,ej,s of unit vectors for V. Let R={(uj,ej,k):1≤j≤n,1≤k≤s}. Then
Then for each quantum channel E, by the concavity of the logarithm function (which is the arithmetic-geometric mean inequality), we have
L(R,E)=∑nj=1∑nk=1−log(E(uju∗j)ej,k,ej,k⟩)
≤∑nj=1−log(∑nk=1⟨E(uju∗j)ej,k,ej,k⟩)
=∑nj=1−log(Tr(E)). Here, equality is reached if and only if
E(uju∗j)ej,k,ej,k⟩=E(uju∗j)ej,l,ej,l⟩ for each j,k,l, but this equality can be achieved by the channel
defined by E(X)=Tr(X)⋅I/s which is known as the completely depolarizing channel. This is the channel that always takes a quantum state and returns the completely mixed state. On the other hand, the channel E has maximum Choi rank since the Choi representation of E is just the identity function divided by the rank. This example is not unexpected since for each input of R the possible outputs span the entire space V evenly, so one does not have any information about the output from any particular input except that we know that the output could be anything. This example shows that the channels that locally minimize the loss function L(R,E) are the channels that give us a sort of linear regression of R but where this linear regression takes into consideration uncertainty in the output so the regression of a output of a state is a mixed state rather than a pure state.