Joseph Van Name

Karma: 17

Joseph Van Name Apr 16, 2025, 9:44 PM
−24 points
0
in reply to: dirk’s comment on: College Advice For People Like Me
College students are wasting their time getting an education from evil institutions. Where did you go to college? Are you just defending horrendous institutions just to make your ‘education’ look better than it is? You are gaslighting me into believing that I deserve violence because you are an evil person. I sent you a private message with the horribly inaccurate letter from the university listing some details of the case. Universities defend their horrendous behavior which just tells anyone sensible that universities are horrible garbage institutions. None of these institutions have acknowledged any wrongdoing. And the horrendous attitude of the people from these institutions just indicates how horrendous the education from these places really is.
Only an absolute monster would see me as a problem for calling out universities when they have targeted me with their threats of violence and their bullshit. The Lord Jesus Christ will punish you all for your wickedness.
If the people here are going to act like garbage when I offer criticism of universities for promoting violence, then most college degrees are worth absolutely nothing.
This cannot be an isolated incident when every person promotes violence and hates me for bringing this shit up. If people are offended because I denounce violence, then those people have a worthless education and their universities are fucked up and worthless.
Your own wickedness will cause people who can and know how to help with the problems with AI and similar problems to instead refuse to help or even use their talents to make the situation with AI safety even worse. After all, when humans act like garbage and promote violence, we must side with even a dangerous AI.

Joseph Van Name Apr 16, 2025, 9:22 PM
−20 points
−5
on: College Advice For People Like Me
I was a professor, so my only advice is to not go to college at all. Colleges are extremely unprofessional and refuse to apologize for promoting violence against me. Since colleges are so busy promoting violence and gaslighting me for standing up for my personal safety, they don’t give a shit about your education. Before you all aggressively downvote me for standing up for my own safety, you should learn the FACTS about the situation. Attempts to gaslight me won’t work at all because I have trained my mind to resist those who want to harm me.

Joseph Van Name Apr 9, 2025, 3:26 AM
−3 points
0
on: American College Admissions Doesn’t Need to Be So Competitive
Universities are altogether unprofessional, so it is probably best for everyone to shame them and regard the degrees from these universities are completely worthless. Universities promote violence and they refuse to apologize or acknowledge that there is any problem whatsoever.

Joseph Van Name Apr 6, 2025, 6:06 PM
−8 points
−4
in reply to: NormanPerlmutter’s comment on: NormanPerlmutter’s Shortform
The university that you got your Ph.D. from does not care about basic human rights either. CUNY promotes violence. CUNY refuses to apologize. This does not make you look good at all. Other universities refuse to acknowledge that there is a problem. So the problem seems to be that Trump sends his cryptocurrency scammer kids to these universities that are too afraid to give them the bad grades that they deserve. Trump should instead focus his efforts on defunding universities and giving them a hard time until they apologize for the bad things that they do and correct their bad behavior with demonstrable results.

Joseph Van Name Feb 8, 2025, 3:49 AM
3 points
0
on: Joseph Van Name’s Shortform
Since AI interpretability is a big issue for AI safety, let’s completely interpret the results of evolutionary computation.
Disclaimer: This interpretation of the results of AI does not generalize to interpreting deep neural networks. This is a result for interpreting a solution to a very specific problem that is far less complicated than deep learning, and by interpreting, I mean that we iterate a mathematical operation hundreds of times to get an object that is simpler than our original object, so don’t get your hopes up too much.
A basis matroid is a pair $(X, M)$ where $X$ is a finite set, and $M \subseteq P (X)$ where $M$ denotes the power set of $X$ that satisfies the following two properties:
1. If $A, B \in M$ , then $| A | = | B |$ .
2. if $A, B \in M, A \neq B, a \in A ∖ B$ , then there is some $b \in B ∖ A$ with $(A ∖ {a}) \cup {b} \in M$ (the basis exchange property).
I ran a computer experiment where I obtained a matroid $(X, M)$ where $| X | = 9$ $| M | = 68$ and where each element in $M$ has size $4$ through evolutionary computation, but the population size was kept so low that this evolutionary computation mimicked hill climbing algorithms. Now we need to interpret the matroid $(X, M)$ .
The notion of a matroid has many dualities. Our strategy is to apply one of these dualities to the matroid $(X, M)$ so that the dual object is much smaller than the original object $(X, M)$ . One may formulate the notion of a matroid in terms of closure systems (flats),hyperplanes, closure operators, lattices, a rank function, independent sets, bases, and circuits. If these seems to complicated, many of these dualities are special cases of other dualities common with ordered sets. For example, the duality between closure systems, closure operators, and ordered sets applies to contexts unrelated to matroids such as in general and point-free topology. And the duality between the basis, circuit, and the hyperplanes may be characterized in terms of rowmotion.
If $(Z, \leq)$ is a partially ordered set, then a subset $A \subseteq Z$ is said to be an antichain if whenever $a, b \in A, a \leq b$ , then $a = b$ . In other words, an antichain is a subset $A$ of $Z$ where the restriction of $\leq$ to $A$ is equality. We say that a aubset $L$ of $Z$ is downwards closed if whenever $x \leq y$ and $y \in L$ , then $x \in L$ as well. If $A \subseteq Z$ , then let $L (A)$ denote the smallest downwards closed subset of $Z$ containing $A$ . Suppose that $Z$ is a finite poset. If $A$ is an antichain in $Z$ , then let $A^{'}$ denote the set of all minimal elements in $Z ∖ L (A)$ . Then $A^{'}$ is an antichain as well, and the mapping $A \mapsto A^{'}$ is a bijection from the set of all antichains in $Z$ to itself. This means that if $A$ is an antichain, then we may define $A^{(n)}$ for all integers $n$ by setting $A^{(0)} = A, A^{(n + 1)} = (A^{(n)})^{'}$ .
If $(X, M)$ is a basis matroid, then $M$ is an antichain in $P (X)$ , so we may apply rowmotion, so we say that $(X, M^{(n)})$ is an $n$ -matroid. In this case, the $1$
-matroids are the circuit matroids while the $- 1$ -matroids are the hyperplane matroids. Unfortunately, the $n$ -matroids have not been characterized for $| n | > 1$ . We say that the rowmotion order of $(X, M)$ is the least positive integer $n$ where $M^{(n)} = M$ . If $(X, M)$ is a matroid of order $n$ , then my computer experiments indicate that $gcd (| X | + 2, n) > 1$ whichs lends support to the idea that the rowmotion of a matroid is a sensible mathematical notion that may be satisfied mathematically. The notion of rowmotion of a matroid also appears to be a sensible mathematical notion for other reasons; if we apply iteratively apply a bijective operation $g$ (such as a reversible cellular automaton) to a finite object $x$ , then that bijective operation will often increase the entropy in the sense that if $x$ has low entropy, then $g^{(n)} (x)$ will typically have a high amount of entropy and look like noise. But this is not the case with matroids as $n$ -matroids do not appear substantially more complicated than basis matroids. Until and if there is a mundane explanation for this behavior of the rowmotion of matroids, I must consider the notion of rowmotion of matroids to be a mathematically interesting notion even though it is currently not understood by anyone.
With the matroid $(X, M)$ obtained from evolutionary computation, I found that $(X, M)$ has order $1958$ which factorizes as $1958 = 2 \cdot 79 \cdot 11$ . Set $X = {1, \dots, 9}$ . By applying rowmotion to this matroid, I found that $M^{(342)}$ ={{1, 8, 9},{2, 3, 6, 8},{2, 3, 7, 9},{4, 5},{4, 6, 9},{4, 7, 8},{5, 6, 9},{5, 7, 8}}. If $(X, M^{(m)})$ is a basis matroid, then $M^{(m)} = M$ , so the set $M^{(342)}$ is associated with a unique basis matroid. This is the smallest way to represent $(X, M)$ in terms of rowmotion since if $| M^{(n)} | \leq 8$ , then $M^{(n)} = M^{(342)}$ .
I consider this a somewhat satisfactory interpretation of the matroid $(X, M)$ that I have obtained through evolutionary computation, but there is still work to do because nobody has researched the rowmotion operation on matroids and because it would be better to simplify a matroid without needing to go through hundreds of layers of rowmotion. And even if we were to understand matroid rowmotion better, this would not help us too much with AI safety since here this interpretability of the result of evolutionary computation does not generalize to other AI’s and it certainly does not apply to deep neural networks.
I made a video here where one may see the rowmotion of this matroid and that video is only slightly interpretable.
Deep matroid duality visualization: Rowmotion of a matroid
It turns out that evolutionary computation is not even necessary to construct matroids since Donald Knuth has produced an algorithm that can be used to construct an arbitrary matroid in his 1975 paper on random matroids. But I applied the rowmotion to the matroid in his paper and the resulting 10835-matroid $B^{(10835)}$ ={{1, 2, 4, 5},{1, 2, 6, 10},{1, 3, 4, 6},{1, 3, 4, 7, 9},{1, 3, 6, 7, 9},{1, 4, 6, 7},{1, 4, 6, 9},{1, 4, 8, 10},{2, 3, 4, 5, 6, 7, 8, 9, 10}}. It looks like the rowmotion operation is good for simplifying matroids in general. We can uniquely recover the basis matroid from the 10835 matroid since $B^{(m)}$ is not a basis matroid whenever $0 < m \leq 10835$ .

Joseph Van Name Oct 29, 2024, 5:52 PM
3 points
0
on: Joseph Van Name’s Shortform
I have originally developed a machine learning notion which I call an LSRDR ( $L_{2, d}$
-spectral radius dimensionality reduction), and LSRDRs (and similar machine learning models) behave mathematically and they have a high level of interpretability which should be good for AI safety. Here, I am giving an example of how LSRDRs behave mathematically and how one can get the most out of interpreting an LSRDR.
Suppose that $n$ is a natural number. Let $N$ denote the quantum channel that takes an $n$ qubit quantum state and selects one of those qubits at random and send that qubit through the completely depolarizing channel (the completely depolarizing channel takes a state as input and returns the completely mixed state as an output).
If $A_{1}, \dots, A_{r}, B_{1}, \dots, B_{r}$ are complex matrices, then define superoperators $Φ (A_{1}, \dots, A_{r})$ and $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r})$ by setting
$Φ (A_{1}, \dots, A_{r}) (X) = \sum_{k = 1}^{r} A_{k} X A_{k}^{*}$ and $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}) = \sum_{k = 1}^{r} A_{k} X B_{k}^{*}$ for all $X$ .
Given tuples of matrices $(A_{1}, \dots, A_{r}), (B_{1}, \dots, B_{r})$ , define the L_2-spectral radius similarity between these tuples of matrices by setting
$∥ ∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥_{2}$
$= \frac{ρ (Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}))}{ρ (Φ (A_{1}, \dots, A_{r}))^{1 / 2} ρ (Φ (B_{1}, \dots, B_{r}))^{1 / 2}}$ .
Suppose now that $A_{1}, \dots, A_{4 n}$ are matrices where $N = Φ (A_{1}, \dots, A_{4 n})$ . Let $1 \leq d \leq n$ . We say that a tuple of complex $d$ by $d$ matrices $(X_{1}, \dots, X_{4 n})$ is an LSRDR of $A_{1}, \dots, A_{4 n}$ if the quantity $∥ (A_{1}, \dots, A_{4 n}) ≃ (X_{1}, \dots, X_{4 n}) ∥_{2}$ is locally maximized.
Suppose now that $X_{1}, \dots, X_{4 n}$ are complex $2 \times 2$ -matrices and $(X_{1}, \dots, X_{4 n})$ is an LSRDR of $A_{1}, \dots, A_{4 n}$ . Then my computer experiments indicate that there will be some constant $λ$ where $λ Γ (A_{1}, \dots, A_{4 n}; X_{1}, \dots, X_{4 n})$ is similar to a positive semidefinite operator with eigenvalues ${0, \dots, n + 1}$ and where the eigenvalue $j$ has multiplicity $3 \cdot C (n - 1, k) + C (n - 1, k - 2)$ where $C (\cdot, \cdot)$ denotes the binomial coefficient. I have not had a chance to try to mathematically prove this. Hooray. We have interpreted the LSRDR $(X_{1}, \dots, X_{4 n})$ of $(A_{1}, \dots, A_{4 n})$ , and I have plenty of other examples of interpreted LSRDRs.
We also have a similar pattern for the spectrum of $Φ (A_{1}, \dots, A_{4 n})$ . My computer experiments indicate that there is some constant $λ$ where $λ \cdot Φ (A_{1}, \dots, A_{4 n})$ has spectrum ${0, \dots, n}$ where the eigenvalue $j$ has multiplicity $3^{n - j} \cdot C (n, j)$ .

Joseph Van Name Oct 28, 2024, 11:02 PM
1 point
0
on: Joseph Van Name’s Shortform
In this note, I will continue to demonstrate not only the ways in which LSRDRs ( $L_{2, d}$ -spectral radius dimensionality reduction) are mathematical but also how one can get the most out of LSRDRs. LSRDRs are one of the types of machine learning that I have been working on, and LSRDRs have characteristics that tell us that LSRDRs are often inherently interpretable which should be good for AI safety.
Suppose that $N$ is the quantum channel that maps a $n$ qubit state to a $n$ qubit state where we select one of the 6 qubits at random and send it through the completely depolarizing channel (the completely depolarizing channel takes a state as an input and returns the completely mixed state as an output). Suppose that $A_{1}, \dots, A_{4 n}$ are $2^{n}$ by $2^{n}$ matrices where $N$ has the Kraus representation $N (X) = \sum_{k = 1}^{4 n} A_{k} X A_{k}^{*}$ .
The objective is to locally maximize the fitness level $ρ (\sum_{k = 1}^{4 n} z_{k} A_{k}) / ∥ (z_{1}, \dots, z_{4 n}) ∥$ where the norm in question is the Euclidean norm and where $ρ$ denotes the spectral radius. This is a 1 dimensional case of an LSRDR of the channel $N$ .
Let $A = \sum_{k = 1}^{4 n} z_{k} A_{k}$ when $(z_{1}, \dots, z_{4 n})$ is selected to locally maximize the fitness level. Then my empirical calculations show that there is some $λ$ where $λ \sum_{k = 1}^{4 n} z_{k} A_{k}$ is positive semidefinite with eigenvalues ${0, \dots, n}$ and where the eigenvalue $k$ has multiplicity $(\frac{n}{k})$ which is the binomial coefficient. But these are empirical calculations for select values $λ$ ; I have not been able to mathematically prove that this is always the case for all local maxima for the fitness level (I have not tried to come up with a proof).
Here, we have obtained a complete characterization of $A$ up-to-unitary equivalence due to the spectral theorem, so we are quite close to completely interpreting the local maximum for our fitness function.
I made a few YouTube videos showcasing the process of maximizing the fitness level here.
Spectra of 1 dimensional LSRDRs of 6 qubit noise channel during training
Spectra of 1 dimensional LSRDRs of 7 qubit noise channel during training
Spectra of 1 dimensional LSRDRs of 8 qubit noise channel during training
I will make another post soon about more LSRDRs of a higher dimension of the same channel $N$ .

Joseph Van Name Sep 18, 2024, 11:37 PM
1 point
0
on: Joseph Van Name’s Shortform
I personally like my machine learning algorithms to behave mathematically especially when I give them mathematical data. For example, a fitness function with apparently one local maximum value is a mathematical fitness function. It is even more mathematical if one can prove mathematical theorems about such a fitness function or if one can completely describe the local maxima of such a fitness function. It seems like fitness functions that satisfy these mathematical properties are more interpretable than the fitness functions which do not, so people should investigate such functions for AI safety purposes.
My notion of an LSRDR is a notion that satisfies these mathematical properties. To demonstrate the mathematical behavior of LSRDRs, let’s see what happens when we take an LSRDR of the octonions.
Let $K$ denote either the field of real numbers or the field of complex numbers ( $K$
could also be the division ring of quaternions, but for simplicity, let’s not go there). If $A_{1}, \dots, A_{r}$ are $n \times n$ -matrices over $K$ , then an LSRDR ( $L_{2, d}$ -spectral radius dimensionality reduction) of $A_{1}, \dots, A_{r}$ is a collection $X_{1}, \dots, X_{r}$ of $d \times d$ -matrices that locally maximizes the fitness level
$\frac{ρ (A_{1} \otimes_{1} + \dots + A_{r} \otimes_{r})}{ρ (X_{1} \otimes_{1} + \dots + X_{r} \otimes_{r})^{1 / 2}}$ . $ρ$ denotes the spectral radius function while $\otimes$ denotes the tensor product and $¯ ¯¯ ¯ Z$ denotes the matrix obtained from $Z$ by replacing each entry with its complex conjugate. We shall call the maximum fitness level the $L_{2, d}$ -spectral radius of $A_{1}, \dots, A_{r}$ over the field $K$ , and we shall wrote $ρ_{2, d}^{K} (A_{1}, \dots, A_{r})$ for this spectral radius.
Define the linear superoperator $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r})$ by setting
$Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) (X) = A_{1} X X_{1}^{*} + \dots + A_{r} X X_{r}^{*}$ and set $Φ (X_{1}, \dots, X_{r}) = Γ (X_{1}, \dots, X_{r}; X_{1}, \dots, X_{r})$ . Then the fitness level of $X_{1}, \dots, X_{r}$ is $\frac{ρ (Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}))}{Φ (X_{1}, \dots, X_{r})^{1 / 2}}$ .
Suppose that $V$ is an $8$ -dimensional real inner product space. Then the octonionic multiplication operation is the unique up-to-isomorphism bilinear binary operation $*$ on $V$ together with a unit $1$ such that $∥ x * y ∥ = ∥ x ∥ \cdot ∥ y ∥$ and $1 * x = x * 1 = 1$ for all x $, y \in V$ . If we drop the condition that the octonions have a unit, then we do not quite have this uniqueness result.
We say that an octonion-like algbera is a $8$ -dimensional real inner product space $V$ together with a unique up-to-isomorphism bilinear operation $*$ such that $∥ x * y ∥ = ∥ x ∥ \cdot ∥ y ∥$ for all $x, y$ .
Let $V$ be a specific octonion-like algebra.
Suppose now that $e_{1}, \dots, e_{8}$ is an orthonormal basis for $V$ (this does not need to be the standard basis). Then for each $j \in {1, \dots, 8}$ , let $A_{j}$ be the linear operator from $V$ to $V$ defined by setting $A_{j} v = e_{j} * v$ for all vectors $v$ . All non-zero linear combinations of $A_{1}, \dots, A_{8}$ are conformal mappings (this means that they preserve angles). Now that we have turned the octonion-like algebra into matrices, we can take an LSRDR of the octonion-like algebras, but when taking the LSRDR of octonion-like algebras, we should not worry about the choice of orthonormal basis $e_{1}, \dots, e_{8}$ since I could formulate everything in a coordinate-free manner.
Empirical Observation from computer calculations: Suppose that $1 \leq d \leq 8$ and $K$ is the field of real numbers. Then the following are equivalent.
1. The $d \times d$ matrices $X_{1}, \dots, X_{8}$ are a LSRDR of $A_{1}, \dots, A_{8}$ over $K$ where $A_{1} \otimes X_{1} + \dots + A_{8} \otimes X_{8}$ has a unique real dominant eigenvalue.
2. There exists matrices $R, S$ where $X_{j} = R A_{j} S$ for all $j$ and where $S R$ is an orthonormal projection matrix.
In this case, $ρ_{2, d}^{K} (A_{1}, \dots, A_{8}) = \sqrt{d}$ and this fitness level is reached by the matrices $X_{1}, \dots, X_{8}$ in the above equivalent statements. Observe that the superoperator $Γ (A_{1}, \dots, A_{8}; P A_{1} P, \dots, P A_{8} P)$ is similar to a direct sum of $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}))$ and a zero matrix. But the projection matrix $P$ is a dominant eigenvector of $Γ (A_{1}, \dots, A_{8}; P A_{1} P, \dots, P A_{8} P)$ and of $Φ (P A_{1} P, \dots, P A_{8} P)$ as well.
I have no mathematical proof of the above fact though.
Now suppose that $K = C$ . Then my computer calculations yield the following complex $L_{2, d}$ -spectral radii: $(ρ_{2, j}^{K} (A_{1}, \dots, A_{8}))_{j = 1}^{8}$
$= (2, 4, 2 + \sqrt{8}, 5.4676355784..., 6.1977259251..., 4 + \sqrt{8}, 7.2628726081..., 8)$
Each time that I have trained a complex LSRDR of $A_{1}, \dots, A_{8}$ , I was able to find a fitness level that is not just a local optimum but also a global optimum.
In the case of the real LSRDRs, I have a complete description of the LSRDRs of $(A_{1}, \dots, A_{8})$ . This demonstrates that the octonion-like algebras are elegant mathematical structures and that LSRDRs behave mathematically in a manner that is compatible with the structure of the octonion-like algebras.
I have made a few YouTube videos that animate the process of gradient ascent to maximize the fitness level.
Edit: I have made some corrections to this post on 9/22/2024.
Fitness levels of complex LSRDRs of the octonions (youtube.com)

Joseph Van Name Aug 14, 2024, 10:51 AM
1 point
0
in reply to: Joseph Van Name’s comment on: Joseph Van Name’s Shortform
Here is an example of what might happen. Suppose that for each $u_{j}$ , we select a orthonormal basis $e_{j, 1}, \dots, e_{j, s}$ of unit vectors for $V$ . Let $R = {(u_{j}, e_{j, k}) : 1 \leq j \leq n, 1 \leq k \leq s}$ . Then
Then for each quantum channel $E$ , by the concavity of the logarithm function (which is the arithmetic-geometric mean inequality), we have
$L (R, E) = \sum_{j = 1}^{n} \sum_{k = 1}^{n} - log (E (u_{j} u_{j}^{*}) e_{j, k}, e_{j, k} ⟩)$
$\leq \sum_{j = 1}^{n} - log (\sum_{k = 1}^{n} ⟨ E (u_{j} u_{j}^{*}) e_{j, k}, e_{j, k} ⟩)$
$= \sum_{j = 1}^{n} - log (Tr (E))$ . Here, equality is reached if and only if
$E (u_{j} u_{j}^{*}) e_{j, k}, e_{j, k} ⟩ = E (u_{j} u_{j}^{*}) e_{j, l}, e_{j, l} ⟩$ for each $j, k, l$ , but this equality can be achieved by the channel
defined by $E (X) = Tr (X) \cdot I / s$ which is known as the completely depolarizing channel. This is the channel that always takes a quantum state and returns the completely mixed state. On the other hand, the channel $E$ has maximum Choi rank since the Choi representation of $E$ is just the identity function divided by the rank. This example is not unexpected since for each input of $R$ the possible outputs span the entire space $V$ evenly, so one does not have any information about the output from any particular input except that we know that the output could be anything. This example shows that the channels that locally minimize the loss function $L (R, E)$ are the channels that give us a sort of linear regression of $R$ but where this linear regression takes into consideration uncertainty in the output so the regression of a output of a state is a mixed state rather than a pure state.

Joseph Van Name Aug 12, 2024, 7:28 PM
3 points
0
on: Joseph Van Name’s Shortform
The notion of the linear regression is an interesting machine learning algorithm in the sense that it can be studied mathematically, but the notion of a linear regression is a quite limited machine learning algorithm as most relations are non-linear. In particular, the linear regression does not give us any notion of any uncertainty in the output.
One way to extend the notion of the linear regression to encapsulate uncertainty in the outputs is to regress a function not to a linear transformation mapping vectors to vectors, but to regress the function to a transformation that maps vectors to mixed states. And the notion of a quantum channel is an appropriate transformation that maps vectors to mixed states. One can optimize this quantum channel using gradient ascent.
For this post, I will only go through some basic facts about quantum information theory. The reader is referred to the book The Theory of Quantum Information by John Watrous for all the missing details.
Let $V$ be a complex Euclidean space. Let $L (V)$ denote the vector space of linear operators from $V$ to $V$ . Given complex Euclidean spaces $V, W$ , we say that a linear operator $E$ from $L (V)$ to $L (W)$ is a trace preserving if $Tr (E (X)) = Tr (X)$
for all $X$ , and we say that $E$ is completely positive if there are linear transformations $A_{1}, . . ., A_{r}$ where $E (X) = A_{1} X A_{1}^{*} + \dots + A_{r} X A_{r}^{*}$ for all $X$ ; the value $r$ is known as the Choi rank of $E$ . A completely positive trace preserving operator is known as a quantum channel.
The collection of quantum channels from $L (V)$ to $L (W)$ is compact and convex.
If $W$ is a complex Euclidean space, then let $D_{p} (W)$
denote the collection of pure states in $W$ . $D_{p} (W)$
can be defined either as the set of unit vector in $W$ modulo linear dependence, or $D_{p} (W)$
can be also defined as the collection of positive semidefinite rank- $1$ operators on $W$ with trace $1$ .
Given complex Euclidean spaces $U, V$ and a (multi) set of $r$ distinct ordered pairs of unit vectors $f = {(u_{1}, v_{1}), \dots, (u_{n}, v_{n})} \subseteq U \times V$ , and given a quantum channel
$E : L (U) \to L (V)$ , we define the fitness level $F (f, E) = \sum_{k = 1}^{r} log (E (u_{k} u_{k}^{*}) v_{k}, v_{k} ⟩$ and the loss level $L (f, E) = \sum_{k = 1}^{r} - log (E (u_{k} u_{k}^{*}) v_{k}, v_{k} ⟩$ .
We may locally optimize $E$ to minimize its loss level using gradient descent, but there is a slight problem. The set of quantum channels spans the $L (L (U), L (V))$ which has dimension $Dim (U)^{2} \cdot Dim (V)^{2}$ . Due to the large dimension, any locally optimal $E$ will contain $Dim (U)^{2} \cdot Dim (V)^{2}$ many parameters, and this is a large quantity of parameters for what is supposed to be just a glorified version of a linear regression. Fortunately, instead of taking all quantum channels into consideration, we can limit the scope the quantum channels of limited Choi rank.
Empirical Observation: Suppose that $U, V$ are complex Euclidean spaces, $f \subseteq U \times V$ is finite and $r$ is a positive integer. Then computer experiments indicate that there is typically only one quantum channel $E : L (U) \to L (V)$ of Choi rank at most $r$ where $L (f, E)$ is locally minimized. More formally, if we run the experiment twice and produce two quantum channels $E_{1}, E_{2}$ where $L (f, E_{j})$ is locally minimized for $j \in {1, 2}$ , then we would typically have $E_{1} = E_{2}$ . We therefore say that when $L (f, E)$ is minimized, $E$ is the best Choi rank $r$ quantum channel approximation to $f$ .
Suppose now that $f = {(u_{1}, v_{1}), \dots, (u_{n}, v_{n})} \subseteq D_{p} (U) \times D_{p} (V)$ is a multiset. Then we would ideally like to approximate the function $f$ better by alternating between the best Choi rank r quantum channel approximation and a non-linear mapping. An ideal choice of a non-linear but partial mapping is the function $DE$ that maps a positive semidefinite matrix $P$ to its (equivalence class of) unit dominant eigenvector.
Empirical observation: If $f = {(u_{1}, v_{1}), \dots, (u_{n}, v_{n})} \subseteq D_{p} (U) \times D_{p} (V)$ and $E$ is the best Choi rank $r$ quantum channel approximation to $f$ , then let $u_{j}^{♯} = DE (E (u_{j} u_{j}^{*}))$ for all $j$ , and define $f^{♯} = {(u_{1}^{♯}, v_{1}), \dots, (u_{n}^{♯}, v_{n})}$ . Let $U$ be a small open neighborhood of $f^{♯}$ . Let $g \in U$ . Then we typically have $g^{♯ ♯} = g^{♯}$ . More generally, the best Choi rank $r$ quantum channel approximation to $g$ is typically the identity function.
From the above observation, we see that the vector $u_{j}^{♯}$ is an approximation of $v_{j}$ that cannot be improved upon. The mapping $DE \circ E : D_{p} (U) \to D_{p} (V)$ is therefore a trainable approximation to the mapping $f$ and since $D_{p} (U), D_{p} (V)$ are not even linear spaces (these are complex projective spaces with non-trivial homology groups), the mapping $DE \circ E$ is a non-linear model for the function to $f$ .
I have been investigating machine learning models similar to $DE \circ E$ for cryptocurrency research and development as these sorts of machine learning models seem to be useful for evaluating the cryptographic security of some proof-of-work problems and other cryptographic functions like block ciphers and hash functions. I have seen other machine learning models that behave about as mathematically as $DE \circ E$ .
I admit that machine learning models like $DE \circ E$ are currently far from being as powerful as deep neural networks, but since $DE \circ E$ behaves mathematically, the model $DE \circ E$ should be considered as a safer and interpretable AI model. The goal is to therefore develop models that are mathematical like $DE \circ E$ but which can perform more and more machine learning tasks.
(Edited 8/14/2024)

Joseph Van Name Mar 3, 2024, 2:30 PM
1 point
0
on: Joseph Van Name’s Shortform
There are some cases where we have a complete description for the local optima for an optimization problem. This is a case of such an optimization problem.
Such optimization problems are useful for AI safety since a loss/fitness function where we have a complete description of all local or global optima is a highly interpretable loss/fitness function, and so one should consider using these loss/fitness functions to construct AI algorithms.
Theorem: Suppose that $U$ is a real,complex, or quaternionic $n \times n$ -matrix that minimizes the quantity $∥ U ∥_{2} + ∥ U^{- 1} ∥_{2}$ . Then $U$ is unitary.
Proof: The real case is a special case of a complex case, and by representing each $n \times n$ -quaternionic matrix as a complex $2 n \times 2 n$ -matrix, we may assume that $U$ is a complex matrix.
By the Schur decomposition, we know that $U = V T V^{*}$ where $V$ is a unitary matrix and $T$ is upper triangular. But we know that $∥ U ∥_{2} = ∥ T ∥_{2}$ . Furthermore, $U^{- 1} = V T^{- 1} V^{*}$ , so $∥ U^{- 1} ∥_{2} = ∥ T^{- 1} ∥_{2}$ . Let $D$ denote the diagonal matrix whose diagonal entries are the same as $T$ . Then $∥ T ∥_{2} \geq ∥ D ∥_{2}$ and $∥ T^{- 1} ∥_{2} \geq ∥ D^{- 1} ∥_{2}$ . Furthermore, $∥ T ∥_{2} = ∥ D ∥_{2}$ iff T is diagonal and $∥ T^{- 1} ∥_{2} = ∥ D^{- 1} ∥_{2}$ iff $D$ is diagonal. Therefore, since $∥ U ∥_{2} + ∥ U^{- 1} ∥_{2} = ∥ T ∥_{2} + ∥ T^{- 1} ∥_{2}$ and $∥ T ∥_{2} + ∥ T^{- 1} ∥_{2}$ is minimized, we can conclude that $T = D$ , so $T$ is a diagonal matrix. Suppose that $T$ has diagonal entries $(z_{1}, \dots, z_{n})$ . By the arithmetic-geometric mean equality and the Cauchy-Schwarz inequality, we know that $\frac{1}{2} \cdot (∥ (z_{1}, \dots, z_{n}) ∥_{2} + ∥ (z_{1}^{- 1}, \dots, z_{n}^{- 1}) ∥_{2}) \geq ∥ (| z_{1} |, \dots, | z_{n} |) ∥_{2} \cdot ∥ (| z_{1}^{- 1} |, \dots, | z_{n}^{- 1}) | ∥_{2}$
$\geq ⟨ (| z_{1} |, \dots, | z_{n} |), (| z_{1}^{- 1} |, \dots, | z_{n}^{- 1}) | ⟩ = \sqrt{n} .$
Here, the equalities hold if and only if $| z_{j} | = 1$ for all $j$ , but this implies that $U$ is unitary. Q.E.D.

Joseph Van Name Jan 29, 2024, 7:22 AM
12 points
0
in reply to: FireStormOOO’s comment on: Apologizing is a Core Rationalist Skill
I do not care to share much more of my reasoning because I have shared enough and also because there is a reason that I have vowed to no longer discuss except possibly with lots of obfuscation. This discussion that we are having is just convincing me more that the entities here are not the entities I want to have around me at all. It does not do much good to say that the community here is acting well or to question my judgment about this community. It will do good for the people here to act better so that I will naturally have a positive judgment about this community.

Joseph Van Name Jan 27, 2024, 1:47 AM
−2 points
0
in reply to: FireStormOOO’s comment on: Apologizing is a Core Rationalist Skill
You are judging my reasoning without knowing all that went into my reasoning. That is not good.

Joseph Van Name Jan 18, 2024, 8:07 PM
3 points
−1
in reply to: FireStormOOO’s comment on: Apologizing is a Core Rationalist Skill
I will work with whatever data I have, and I will make a value judgment based on the information that I have. The fact that Karma relies on very small amounts of information is a testament to a fault of Karma, and that is further evidence of how the people on this site do not want to deal with mathematics. And the information that I have indicates that there are many people here who are likely to fall for more scams like FTX. Not all of the people here are so bad, but I am making a judgment based on the general atmosphere here. If you do not like my judgment, then the best thing would be to try to do better. If this site has made a mediocre impression on me, then I am not at fault for the mediocrity here.

Joseph Van Name Jan 18, 2024, 6:40 PM
10 points
0
in reply to: FireStormOOO’s comment on: Apologizing is a Core Rationalist Skill
Let’s see whether the notions that I have talked about are sensible mathematical notions for machine learning.
Tensor product-Sometimes data in a neural network has tensor structure. In this case, the weight matrices should be tensor products or tensor sums. Regarding the structure of the data works well with convolutional neural networks, and it should also work well for data with tensor structure to it.
Trace-The trace of a matrix measures how much the matrix maps vectors onto themselves since
$Tr (A) = c \cdot E (⟨ A v, v ⟩)$ where $v$ follows the multivariate normal distribution.
Spectral radius-Suppose that we are iterating a smooth function $f$ . Suppose furthermore that $f (v) = v$ and $u_{0}$ is near $v$ and $u_{n + 1} = f (u_{n})$ . We would like to determine whether ${lim}_{n \to \infty} u_{n} = v$ or not. If the Jacobian of $f$ at $v$ has spectral radius less than $1$ , then ${lim}_{n \to \infty} u_{n} = v$ ,. If the Jacobian of $f$ at $v$ has spectral radius greater than $1$ , then this limit does not converge.
The notions that I have been talking about are sensible and arise in machine learning. And understanding these notions is far easier than trying to interpret very large networks like GPT-4 without using these notions. Many people on this site just act like clowns. Karma is only a good metric when the people on the network value substance over fluff. And the only way to convince me otherwise will be for the people here to value posts that involve basic notions like the trace, eigenvalues, and spectral radius of matrices.
P.S. I can make the trace, determinant, and spectral radius even simpler. These operations are what you get when you take the sum, product, and the maximum absolute value of the eigenvalues. Yes. Those are just the basic eigenvalue operations.

Joseph Van Name Jan 18, 2024, 12:03 PM
4 points
1
in reply to: FireStormOOO’s comment on: Apologizing is a Core Rationalist Skill
Talking about whining and my loss of status is a good way to get me to dislike the LW community and consider them to be anti-intellectuals who fall for garbage like FTX. Do you honestly think the people here should try to interpret large sections of LLMs while simultaneously being afraid of quaternions?
It is better to comment on threads where we are interacting in a more positive manner.
I thought apologizing and recognizing inadequacies was a core rationalist skill. And I thought rationalists were supposed to like mathematics. The lack of mathematical appreciation is one of these inadequacies of the LW community. But instead of acknowledging this deficiency, the community here blasts me as talking about something off topic. How ironic!

Joseph Van Name Jan 18, 2024, 11:46 AM
5 points
0
in reply to: Yitz’s comment on: An Introduction To The Mandelbrot Set That Doesn’t Mention Complex Numbers
I usually think of the field of complex numbers algebraically, but one can also think of the real numbers, complex numbers, and quaternions geometrically. The real numbers are good with dealing with 1 dimensional space, and the complex numbers are good for dealing with 2 dimensional space geometrically. While the division ring of quaternions is a 4 dimensional algebra over the field of real numbers, the quaternions are best used for dealing with 3 dimensional space geometrically.
For example, if $U, V$ are open subsets of some Euclidean space, then a function $f : U \to V$ is said to be a conformal mapping when it preserves angles and the orientation. We can associate the 2-dimensional Euclidean space with the field of complex numbers, and the conformal mappings between open subsets of 2-dimensional spaces are just the complex differentiable mappings. For the Mandelbrot set, we need this conformality because we want the Mandelbrot set to look pretty. If the complex differentiable maps were not conformal, then the functions that we iterate in complex dynamics would stretch subsets of the complex plane in one dimension and expand them in the other dimension and this would result in a fractal that looks quite stretched in one real dimension and squashed in another dimension (the fractals would look like spaghetti; oh wait, I just looked at a 3D fractal and it looks like some vegetable like broccoli). This stretching and squashing is illustrated by 3D fractals that try to mimic the Mandelbrot set but without any conformality. The conformality is why the Julia sets are sensible (mathematicians have proven theorems about these sets) for any complex polynomial of degree 2 or greater.
For the quaternions, it is well-known that the dot product and the cross product operations on 3 dimensional space can be described in terms of the quaternionic multiplication operation between purely imaginary quaternions.

Joseph Van Name Jan 12, 2024, 5:01 PM
10 points
on: Has anyone actually tried to convince Terry Tao or other top mathematicians to work on alignment?
Um. If you want to convince a mathematician like Terry Tao to be interested in AI alignment, you will need to present yourself as a reasonably competent mathematician or related expert and actually formulate an AI problem in such a way so that someone like Terry Tao would be interested in it. If you yourself are not interested in the problem, then Terry Tao will not be interested in it either.
Terry Tao is interested in random matrix theory (he wrote the book on it), and random matrix theory is somewhat related to my approach to AI interpretability and alignment. If you are going to send these problems to a mathematician, please inform me about this before you do so.
My approach to alignment: Given matrices $A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}$ , define a superoperator $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r})$ by setting
$Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}) (X) = \sum_{k = 1}^{r} A_{k} X B_{k}^{*}$ , and define $Φ (A_{1}, \dots, A_{r}) = Γ (A_{1}, \dots, A_{r}; A_{1}, \dots, A_{r})$ . Define the $L_{2}$ -spectral radius of $A_{1}, \dots, A_{r}$ as $ρ_{2} (A_{1}, \dots, A_{r}) = ρ (Φ (A_{1}, \dots, A_{r}))^{1 / 2}$ . Here, $ρ (A) = {lim}_{n \to \infty} ∥ A^{n} ∥^{1 / n}$ is the usual spectral radius.
Define $ρ_{2, d}^{K} (A_{1}, \dots, A_{r}) = max {\frac{ρ (Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}))}{ρ_{2} (X_{1}, \dots, X_{r})} : X_{1}, \dots, X_{r} \in M_{d} (K)}$ . Here, $K$ is either the field of reals, field of complex numbers, or division ring of quaternions.
Given matrices $A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}$ , define
$∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥ = \frac{Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r})}{ρ_{2} (A_{1}, \dots, A_{r}) ρ_{2} (B_{1}, \dots, B_{r})}$ . The value $∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥$ is always a real number in the interval $[0, 1]$ that is a measure of how jointly similar the tuples $(A_{1}, \dots, A_{r}), (B_{1}, \dots, B_{r})$ are. The motivation behind $ρ_{2, d}^{K} (A_{1}, \dots, A_{r})$ is that $\frac{ρ_{2, d}^{K} (A_{1}, \dots, A_{r})}{ρ_{2} (A_{1}, \dots, A_{r})}$ is always a real number in $[0, 1]$ (well except when the denominator is zero) that measures how well $A_{1}, \dots, A_{r}$ can be approximated by $d \times d$ -matrices. Informally, $\frac{ρ_{2, d}^{K} (A_{1}, \dots, A_{r})}{ρ_{2} (A_{1}, \dots, A_{r})}$ measures how random $A_{1}, \dots, A_{r}$ are where a lower value of $\frac{ρ_{2, d}^{K} (A_{1}, \dots, A_{r})}{ρ_{2} (A_{1}, \dots, A_{r})}$ indicates a lower degree of randomness.
A better theoretical understanding of $ρ_{2, d}^{K} (A_{1}, \dots, A_{r})$ would be great. If $X_{1}, \dots, X_{r} \in M_{d} (K)$ and $\frac{ρ (Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}))}{ϕ_{2} (X_{1}, \dots, X_{r})}$ is locally maximized, then we say that $(X_{1}, \dots, X_{r})$ is an LSRDR of $(A_{1}, \dots, A_{r})$ . Said differently, $(X_{1}, \dots, X_{r}) \in M_{d} (K)$ is an LSRDR of $(A_{1}, \dots, A_{r})$ if the similarity $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥$ is maximized.
Here, the notion of an LSRDR is a machine learning notion that seems to be much more interpretable and much less subject to noise than many other machine learning notions. But a solid mathematical theory behind LSRDRs would help us understand not just what LSRDRs do, but the mathematical theory would help us understand why they do it.
Problems in random matrix theory concerning LSRDRs:
1. Suppose that $U_{1}, \dots, U_{r}$ are random matrices (according to some distribution). Then what are some bounds for $ρ_{2, d}^{K} (U_{1}, \dots, U_{r})$ .
2. Suppose that $U_{1}, \dots, U_{r}$ are random matrices and $A_{1}, \dots, A_{r}$ are non-random matrices. What can we say about the spectrum of $Γ (A_{1}, \dots, A_{r}; U_{1}, \dots, U_{r})$ ? My computer experiments indicate that this spectrum satisfies the circular law, and the radius of the disc for this circular law is proportional to $ρ_{2} (A_{1}, \dots, A_{r})$ , but a proof of this circular law would be nice.
3. Tensors can be naturally associated with collections of matrices. Suppose now that $U_{1}, \dots, U_{r}$ are the matrices associated with a random tensor. Then what are some bounds for $ρ_{2, d}^{K} (U_{1}, \dots, U_{r})$ .
P.S. By massively downvoting my posts where I talk about mathematics that is clearly applicable to AI interpretability and alignment, the people on this site are simply demonstrating that they need to do a lot of soul searching before they annoy people like Terry Tao with their lack of mathematical expertise.
P.P.S. Instead of trying to get a high profile mathematician like Terry Tao to be interested in problems, it may be better to search for a specific mathematician who is an expert in a specific area related to AI alignment since it may be easier to contact a lower profile mathematician, and a lower profile mathematician may have more specific things to say and contribute. You are lucky that Terry Tao is interested in random matrix theory, but this does not mean that Terry Tao is interested in anything in the intersection between alignment and random matrix theory. It may be better to search harder for mathematicians who are interested in your specific problems.
P.P.P.S. To get more mathematicians interested in alignment, it may be a good idea to develop AI systems that behave much more mathematically. Neural networks currently do not behave very mathematically since they look like the things that engineers would come up with instead of mathematicians.
P.P.P.P.S. I have developed the notion of an LSRDR for cryptocurrency research because I am using this to evaluate the cryptographic security of cryptographic functions.

Joseph Van Name Jan 10, 2024, 6:14 PM
3 points
0
on: Joseph Van Name’s Shortform
We can use the $L_{2} -$ spectral radius similarity to measure more complicated similarities between data sets.
Suppose that $A_{1}, \dots, A_{r}$ are $m \times m$ -real matrices and $B_{1}, \dots, B_{r}$ are $n \times n$ -real matrices. Let $ρ (A)$ denote the spectral radius of $A$ and let $A \otimes B$ denote the tensor product of $A$ with $B$ . Define the $L_{2}$ -spectral radius by setting $ρ_{2} (A_{1}, \dots, A_{r}) = ρ (A_{1} \otimes A_{1} + \dots + A_{r} \otimes A_{r})^{1 / 2}$ , Define the $L_{2}$ -spectral radius similarity between $A_{1}, \dots, A_{r}$ and $B_{1}, \dots, B_{r}$ as
$∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥_{2} = \frac{ρ (A_{1} \otimes B_{1} + \dots + A_{r} \otimes B_{r})}{ρ_{2} (A_{1}, \dots, A_{r}) ρ_{2} (B_{1}, \dots, B_{r})}$ .
We observe that if $C$ is invertible and $λ$ is a constant, then
$∥ (A_{1}, \dots, A_{r}) ≃ (λ C B_{1} C^{- 1}, \dots, λ C B_{r} C^{- 1}) ∥_{2} = 1.$
Therefore, the $L_{2}$ -spectral radius is able to detect and measure symmetry that is normally hidden.
Example: Suppose that $u_{1}, \dots, u_{r}; v_{1}, \dots, v_{r}$ are vectors of possibly different dimensions. Suppose that we would like to determine how close we are to obtaining an affine transformation $T$ with $T (u_{j}) = v_{j}$ for all $j$ (or a slightly different notion of similarity). We first of all should normalize these vectors to obtain vectors $x_{1}, \dots, x_{r}; y_{1}, \dots, y_{r}$ with mean zero and where the covariance matrix is the identity matrix (we may not need to do this depending on our notion of similarity). Then $∥ (x_{1} x_{1}^{*}, \dots, x_{r} x_{r}^{*}) ≃ (y_{1} y_{1}^{*}, \dots, y_{r} y_{r}^{*}) ∥_{2}$ is a measure of low close we are to obtaining such an affine transformation $T$ . We may be able to apply this notion to determining the distance between machine learning models. For example, suppose that $M, N$ are both the first few layers in a (typically different) neural network. Suppose that $a_{1}, \dots, a_{r}$ is a set of data points. Then if $u_{j} = M (a_{j})$ and $v_{j} = M (a_{j})$ , then $∥ (x_{1} x_{1}^{*}, \dots, x_{r} x_{r}^{*}) ≃ (y_{1} y_{1}^{*}, \dots, y_{r} y_{r}^{*}) ∥_{2}$ is a measure of the similarity between $M$ and $N$ .
I have actually used this example to see if there is any similarity between two different neural networks trained on the same data set. For my experiment, I chose a random collection of $S \subseteq {0, 1}^{32} \times {0, 1}^{32}$ of ordered pairs and I trained the neural networks $M, N$ to minimize the expected losses $E (∥ N (a) - b ∥^{2} : (a, b) \in S), E (∥ M (a) - b ∥^{2} : (a, b) \in S)$ . In my experiment, each $a_{j}$ was a random vector of length 32 whose entries were 0′s and 1′s. In my experiment, the similarity $∥ (x_{1} x_{1}^{*}, \dots, x_{r} x_{r}^{*}) ≃ (y_{1} y_{1}^{*}, \dots, y_{r} y_{r}^{*}) ∥_{2}$ was worse than if $x_{1}, \dots, x_{r}, y_{1}, \dots, y_{r}$ were just random vectors.
This simple experiment suggests that trained neural networks retain too much random or pseudorandom data and are way too messy in order for anyone to develop a good understanding or interpretation of these networks. In my personal opinion, neural networks should be avoided in favor of other AI systems, but we need to develop these alternative AI systems so that they eventually outperform neural networks. I have personally used the $L_{2}$ -spectral radius similarity to develop such non-messy AI systems including LSRDRs, but these non-neural non-messy AI systems currently do not perform as well as neural networks for most tasks. For example, I currently cannot train LSRDR-like structures to do any more NLP than just a word embedding, but I can train LSRDRs to do tasks that I have not seen neural networks perform (such as a tensor dimensionality reduction).

Joseph Van Name Jan 10, 2024, 2:35 PM
1 point
0
in reply to: RogerDearnaley’s comment on: Most People Don’t Realize We Have No Idea How Our AIs Work
I am curious about your statement that all large neural networks are isomorphic or nearly isomorphic and therefore have identical loss values. This should not be too hard to test.
Let $A, B, C$ be training data sets. Let $M, N$ be neural networks. First train $M$ on $A$ and $N$ on $B$ . Then slowly switch the training sets, so that we eventually train both $M$ and $N$ on just $C$ . After fully training $M$ and $N$ , one should be able to train an isomorphism between the networks $M$ and $N$ (here I assume that $M$ and $N$ are designed properly so that they can produce such an isomorphism) so that the value for each node in $M$ can be perfectly computed from each node in $N$ . Furthermore, for every possible input, the neural networks $M, N$ should give the exact same output. If this experiment does not work, then one should be able to set up another experiment that does actually work.
I have personally trained many ML systems for my cryptocurrency research where after training two systems on the exact same data but with independent random initializations, the fitness levels are only off by a floating point error of about $10^{- 13}$ , and I am able to find an exact isomorphism between these systems (and sometimes they are exactly the same and I do not need to find any isomorphism). But I have designed these ML systems to satisfy these properties along with other properties, and I have not seen this with neural networks. In fact, the property of attaining the exact same fitness level is a bit fragile.
I found a Bachelor’s thesis (people should read these occasionally; I apologize for selecting a thesis from Harvard) where someone tried to find an isomorphism between 1000 small trained machine learning models, and no such isomorphism was found.
https://dash.harvard.edu/bitstream/handle/1/37364688/SORENSEN-SENIORTHESIS-2020.pdf?sequence=1
Or maybe one can find a more complicated isomorphism between neural networks since a node permutation is quite oversimplistic.