Joseph Van Name comments on Joseph Van Name’s Shortform

Joseph Van Name 12 Aug 2024 19:28 UTC
3 points
0
The notion of the linear regression is an interesting machine learning algorithm in the sense that it can be studied mathematically, but the notion of a linear regression is a quite limited machine learning algorithm as most relations are non-linear. In particular, the linear regression does not give us any notion of any uncertainty in the output.
One way to extend the notion of the linear regression to encapsulate uncertainty in the outputs is to regress a function not to a linear transformation mapping vectors to vectors, but to regress the function to a transformation that maps vectors to mixed states. And the notion of a quantum channel is an appropriate transformation that maps vectors to mixed states. One can optimize this quantum channel using gradient ascent.
For this post, I will only go through some basic facts about quantum information theory. The reader is referred to the book The Theory of Quantum Information by John Watrous for all the missing details.
Let $V$ be a complex Euclidean space. Let $L (V)$ denote the vector space of linear operators from $V$ to $V$ . Given complex Euclidean spaces $V, W$ , we say that a linear operator $E$ from $L (V)$ to $L (W)$ is a trace preserving if $Tr (E (X)) = Tr (X)$
for all $X$ , and we say that $E$ is completely positive if there are linear transformations $A_{1}, . . ., A_{r}$ where $E (X) = A_{1} X A_{1}^{*} + \dots + A_{r} X A_{r}^{*}$ for all $X$ ; the value $r$ is known as the Choi rank of $E$ . A completely positive trace preserving operator is known as a quantum channel.
The collection of quantum channels from $L (V)$ to $L (W)$ is compact and convex.
If $W$ is a complex Euclidean space, then let $D_{p} (W)$
denote the collection of pure states in $W$ . $D_{p} (W)$
can be defined either as the set of unit vector in $W$ modulo linear dependence, or $D_{p} (W)$
can be also defined as the collection of positive semidefinite rank- $1$ operators on $W$ with trace $1$ .
Given complex Euclidean spaces $U, V$ and a (multi) set of $r$ distinct ordered pairs of unit vectors $f = {(u_{1}, v_{1}), \dots, (u_{n}, v_{n})} \subseteq U \times V$ , and given a quantum channel
$E : L (U) \to L (V)$ , we define the fitness level $F (f, E) = \sum_{k = 1}^{r} log (E (u_{k} u_{k}^{*}) v_{k}, v_{k} ⟩$ and the loss level $L (f, E) = \sum_{k = 1}^{r} - log (E (u_{k} u_{k}^{*}) v_{k}, v_{k} ⟩$ .
We may locally optimize $E$ to minimize its loss level using gradient descent, but there is a slight problem. The set of quantum channels spans the $L (L (U), L (V))$ which has dimension $Dim (U)^{2} \cdot Dim (V)^{2}$ . Due to the large dimension, any locally optimal $E$ will contain $Dim (U)^{2} \cdot Dim (V)^{2}$ many parameters, and this is a large quantity of parameters for what is supposed to be just a glorified version of a linear regression. Fortunately, instead of taking all quantum channels into consideration, we can limit the scope the quantum channels of limited Choi rank.
Empirical Observation: Suppose that $U, V$ are complex Euclidean spaces, $f \subseteq U \times V$ is finite and $r$ is a positive integer. Then computer experiments indicate that there is typically only one quantum channel $E : L (U) \to L (V)$ of Choi rank at most $r$ where $L (f, E)$ is locally minimized. More formally, if we run the experiment twice and produce two quantum channels $E_{1}, E_{2}$ where $L (f, E_{j})$ is locally minimized for $j \in {1, 2}$ , then we would typically have $E_{1} = E_{2}$ . We therefore say that when $L (f, E)$ is minimized, $E$ is the best Choi rank $r$ quantum channel approximation to $f$ .
Suppose now that $f = {(u_{1}, v_{1}), \dots, (u_{n}, v_{n})} \subseteq D_{p} (U) \times D_{p} (V)$ is a multiset. Then we would ideally like to approximate the function $f$ better by alternating between the best Choi rank r quantum channel approximation and a non-linear mapping. An ideal choice of a non-linear but partial mapping is the function $DE$ that maps a positive semidefinite matrix $P$ to its (equivalence class of) unit dominant eigenvector.
Empirical observation: If $f = {(u_{1}, v_{1}), \dots, (u_{n}, v_{n})} \subseteq D_{p} (U) \times D_{p} (V)$ and $E$ is the best Choi rank $r$ quantum channel approximation to $f$ , then let $u_{j}^{♯} = DE (E (u_{j} u_{j}^{*}))$ for all $j$ , and define $f^{♯} = {(u_{1}^{♯}, v_{1}), \dots, (u_{n}^{♯}, v_{n})}$ . Let $U$ be a small open neighborhood of $f^{♯}$ . Let $g \in U$ . Then we typically have $g^{♯ ♯} = g^{♯}$ . More generally, the best Choi rank $r$ quantum channel approximation to $g$ is typically the identity function.
From the above observation, we see that the vector $u_{j}^{♯}$ is an approximation of $v_{j}$ that cannot be improved upon. The mapping $DE \circ E : D_{p} (U) \to D_{p} (V)$ is therefore a trainable approximation to the mapping $f$ and since $D_{p} (U), D_{p} (V)$ are not even linear spaces (these are complex projective spaces with non-trivial homology groups), the mapping $DE \circ E$ is a non-linear model for the function to $f$ .
I have been investigating machine learning models similar to $DE \circ E$ for cryptocurrency research and development as these sorts of machine learning models seem to be useful for evaluating the cryptographic security of some proof-of-work problems and other cryptographic functions like block ciphers and hash functions. I have seen other machine learning models that behave about as mathematically as $DE \circ E$ .
I admit that machine learning models like $DE \circ E$ are currently far from being as powerful as deep neural networks, but since $DE \circ E$ behaves mathematically, the model $DE \circ E$ should be considered as a safer and interpretable AI model. The goal is to therefore develop models that are mathematical like $DE \circ E$ but which can perform more and more machine learning tasks.
(Edited 8/14/2024)
- Joseph Van Name 14 Aug 2024 10:51 UTC
  1 point
  0
  Parent
  Here is an example of what might happen. Suppose that for each $u_{j}$ , we select a orthonormal basis $e_{j, 1}, \dots, e_{j, s}$ of unit vectors for $V$ . Let $R = {(u_{j}, e_{j, k}) : 1 \leq j \leq n, 1 \leq k \leq s}$ . Then
  Then for each quantum channel $E$ , by the concavity of the logarithm function (which is the arithmetic-geometric mean inequality), we have
  $L (R, E) = \sum_{j = 1}^{n} \sum_{k = 1}^{n} - log (E (u_{j} u_{j}^{*}) e_{j, k}, e_{j, k} ⟩)$
  $\leq \sum_{j = 1}^{n} - log (\sum_{k = 1}^{n} ⟨ E (u_{j} u_{j}^{*}) e_{j, k}, e_{j, k} ⟩)$
  $= \sum_{j = 1}^{n} - log (Tr (E))$ . Here, equality is reached if and only if
  $E (u_{j} u_{j}^{*}) e_{j, k}, e_{j, k} ⟩ = E (u_{j} u_{j}^{*}) e_{j, l}, e_{j, l} ⟩$ for each $j, k, l$ , but this equality can be achieved by the channel
  defined by $E (X) = Tr (X) \cdot I / s$ which is known as the completely depolarizing channel. This is the channel that always takes a quantum state and returns the completely mixed state. On the other hand, the channel $E$ has maximum Choi rank since the Choi representation of $E$ is just the identity function divided by the rank. This example is not unexpected since for each input of $R$ the possible outputs span the entire space $V$ evenly, so one does not have any information about the output from any particular input except that we know that the output could be anything. This example shows that the channels that locally minimize the loss function $L (R, E)$ are the channels that give us a sort of linear regression of $R$ but where this linear regression takes into consideration uncertainty in the output so the regression of a output of a state is a mixed state rather than a pure state.