Joseph Van Name

Karma: 46

Joseph Van Name Aug 7, 2025, 5:08 AM
1 point
0
in reply to: AnthonyC’s comment on: Consider showering
Whether one takes or should take a cold shower or not depends on a lot of factors including whether one exercises, one’s health, one’s personal preferences, the air temperature, the cold water temperature, the humidity level, and the hardness of the shower water. But it seems like most people can’t fathom taking a cold shower simply because they are cold intolerant even though cold showers have many benefits.
In addition to the practical benefits of cold showers, cold showers also may offer health benefits.
Cold showers could improve one’s immune system (though we should).
The Effect of Cold Showering on Health and Work: A Randomized Controlled Trial—PMC
Cold showers may boost mood or alleviate depression.
Scientific Evidence-Based Effects of Hydrotherapy on Various Systems of the Body—PMC
Adapted cold shower as a potential treatment for depression—ScienceDirect
Cold showers could also improve circulation and metabolism.
Cold showers also offer other benefits.
I always use the exhaust fan. It is never powerful enough to reduce the humidity faster than a warm shower increases the humidity. I also lock the door when taking a shower, and I do not know why anyone would take a shower without locking the door. Opening the door while showering just makes the rest of the home humid as well, and we can’t have that.
I exercise daily, so out of habit, I always take a shower after I exercise, and most of my showers are after exercise. Even if I spend a few minutes cooling down after exercise, I need the shower to cool down even more, and by taking a warm shower, I cannot cool down as effectively, so I end up sweating after taking the shower. And I sometimes take my temperature after exercise and the shower and even after the shower, I tend to have a mouth temperature of 99.0 to 99.5 degrees Fahrenheit. I doubt that people who barely need to take a shower after exercising are doing much exercise or perhaps they are doing weights instead of cardio which produces less sweat, but in any case, I have never exercised and thought that I do not need a shower regardless of whether I am doing cardio, weights, or whatever.
Soap scum left over after taking a cold shower seems to be a problem for you and for you only.

Joseph Van Name Aug 4, 2025, 12:59 AM
−1 points
0
in reply to: AnthonyC’s comment on: Consider showering
Instead of not taking showers, we should all take cold showers for many reasons.
1. You already mentioned the energy usage which is a problem.
2. Hot showers increase the relative humidity of the bathroom to 100 percent which is way too high. And that humidity means that you get a lot of condensation in the bathroom too. That is good only if you want the bathroom covered in mold.
3. If you take a hot shower that fogs up all the mirrors, you are censoring your own nakeyness. Please don’t do that.
4. I do not care if people shower daily. But people need to exercise daily. And after exercising, people need to shower. As a corollary, most of the time that people shower should be right after exercising. But after exercising, you are already warm, so the goal is to cool down. This means that everyone needs to take a cold shower.
5. Cold intolerance is a major problem. People need to get over it. People who can’t tolerate a little bit of cold probably are intolerant in other areas as well. They cannot go mountain climbing because the mountains have snow on them. They can’t tolerate hot peppers. And they are afraid of spiders too.

Joseph Van Name Jun 12, 2025, 6:36 AM
3 points
0
on: Joseph Van Name’s Shortform
I am going to share an algorithm that I came up with that tends to produce the same result when we run it multiple times with a different initialization. The iteration is not even guaranteed convergence since we are not using gradient ascent, but it typically converges as long as the algorithm is given a reasonable input. This suggests that the algorithm behaves mathematically and may be useful for things such as quantum error correction. After analyzing the algorithm, I shall use the algorithm to solve a computational problem.
We say that an algorithm is pseudodeterministic if it tends to return the same output even if the computation leading to that output is non-deterministic (due to a random initialization). I believe that we should focus a lot more on pseudodetermistic machine learning algorithms for AI safety and interpretability since pseudodeterministic algorithms are inherently interpretable.
Define $f (z) = 3 z^{2} - 2 z^{3}$ for all complex numbers $z$ . Then $f (0) = 0, f (1) = 1, f^{'} (0) = f^{'} (1) = 0$ , and there are neighborhoods $U, V$ of $0, 1$ respectively where if $x \in U$ , then $f^{N} (x) \to 0$ quickly and if $y \in V$ , then $f^{N} (y) \to 1$ quickly. Set $f^{\infty} = {lim}_{N \to \infty} f^{N}$ . The function $f^{\infty}$ serves as error correction for projection matrices since if $Q$ is nearly a projection matrix, then $f^{\infty} (Q)$ will be a projection matrix.
Suppose that $K$ is either the field of real numbers, complex numbers or quaternions. Let $Z (K)$ denote the center of $K$ . In particular, $Z (R) = R, Z (C) = C, Z (H) = R$ .
If $A_{1}, \dots, A_{r}$ are $m \times n$ -matrices, then define $Φ (A_{1}, \dots, A_{r}) : M_{n} (K) \to M_{m} (K)$ by setting $Φ (A_{1}, \dots, A_{r}) = \sum_{k = 1}^{r} A_{k} X A_{k}^{*}$ . Then we say that an operator of the form $Φ (A_{1}, \dots, A_{r})$ is completely positive. We say that a $Z (K)$ -linear operator $E : M_{n} (K) \to M_{m} (K)$ is Hermitian preserving if $E (X)$ is Hermitian whenever $X$ is Hermitian. Every completely positive operator is Hermitian preserving.
Suppose that $E : M_{n} (K) \to M_{n} (K)$ is $Z (K)$ -linear. Let $t > 0$ . Let $P_{0} \in M_{n} (K)$ be a random orthogonal projection matrix of rank $d$ . Set $P_{N + 1} = f^{\infty} (P_{N} + t \cdot E (P_{N}))$ for all $N$ . Then if everything goes well, the sequence $(P_{N})_{N}$ will converge to a projection matrix $P$ of rank $d$ , and the projection matrix $P$ will typically be unique in the sense that if we run the experiment again, we will typically obtain the exact same projection matrix $P$ . If $E$ is Hermitian preserving, then the projection matrix $P$ will typically be an orthogonal projection. This experiment performs well especially when $E$ is completely positive or at least Hermitian preserving or nearly so. The projection matrix $P$ will satisfy the equation $P \cdot E (P) = E (P) \cdot P = P \cdot E (P) \cdot P$ .
In the case when $E$ is a quantum channel, we can easily explain what the projection $P$ does. The operator $P$ is a projection onto a subspace of complex Euclidean space that is particularly well preserved by the channel $E$ . In particular, the image $Im (P)$ is spanned by the top $d$ eigenvectors of $E (P)$ . This means that if we send the completely mixed state $P / d$ through the quantum channel $E$ and we measure the state $E (P / d)$ with respect to the projective measurement $(P, I - P)$ , then there is an unusually high probability that this measurement will land on $P$ instead of $I - P$ .
Let us now use the algorithm that obtains $P$ from $E$ to solve a problem in many cases.
If $x$ is a vector, then let $Diag (x)$ denote the diagonal matrix where $x$ is the vector of diagonal entries, and if $X$ is a square matrix, then let $Diag (X)$ denote the diagonal of $X$ . If $x$ is a length $n$ vector, then $Diag (x)$ is an $n \times n$ -matrix, and if $X$ is an $n \times n$ -matrix, then $Diag (X)$ is a length $n$ vector.
Problem Input: An $n \times n$ -square matrix $A$ with non-negative real entries and a natural number $d$ with $1 \leq d < n$ .
Objective: Find a subset $B \subseteq {1, \dots, n}$ with $| B | = d$ and where if $x = A \cdot χ_{B}$ , then the $d$ largest entries in $x$ are the values $x [b]$ for $b \in B$ .
Algorithm: Let $E$ be the completely positive operator defined by setting $E (X) = Diag (A \cdot Diag (X))$ . Then we run the iteration using $E$ to produce an orthogonal projection $P$ with rank $d$ . In this case, the projection $P$ will be a diagonal projection matrix with rank $d$ where $diag (P) = χ_{B}$ and where $B$ is our desired subset of ${1, \dots, n}$ .
While the operator $P$ is just a linear operator, the pseudodeterminism of the algorithm that produces the operator $P$ generalizes to other pseudodeterministic algorithms that return models that are more like deep neural networks.

Joseph Van Name Jun 4, 2025, 9:01 PM
1 point
0
in reply to: Logan Riggs’s comment on: Spectral radii dimensionality reduction computed without gradient calculations
I would have thought that a fitness function that is maximized using something other than gradient ascent and which can solve NP-complete problems at least in the average case would be worth reading since that means that it can perform well on some tasks but it also behaves mathematically in a way that is needed for interpretability. The quality of the content is inversely proportional to the number of views since people don’t think the same way as I do.
Wheels on the Bus | @CoComelon Nursery Rhymes & Kids Songs
Stuff that is popular is usually garbage.
But here is my post about the word embedding.
Interpreting a matrix-valued word embedding with a mathematically proven characterization of all optima — LessWrong
And I really do not want to collaborate with people who are not willing to read the post. This is especially true of people in academia since universities promote violence and refuse to acknowledge any wrongdoing. Universities are the absolute worst.
Instead of engaging with the actual topic, people tend to just criticize stupid stuff simply because they only want to read about what they already know or what is recommended by their buddies; that is a very good way not to learn anything new or insightful. For this reason, even the simplest concepts are lost on most people.

Joseph Van Name May 28, 2025, 7:22 PM
8 points
0
in reply to: Logan Riggs’s comment on: Spectral radii dimensionality reduction computed without gradient calculations
In this post, the existence of a non-gradient based algorithm for computing LSRDRs is a sign that LSRDRs behave mathematically and are quite interpretable. Gradient ascent is a general purpose optimization algorithm that works in the case when there is no other way to solve the optimization problem, but when there are multiple ways of obtaining a solution to an optimization problem, the optimization problem is behaving in a way that should be appealing to mathematicians.
LSRDRs and similar algorithms are pseudodeterministic in the sense that if we train the model multiple times on the same data, we typically get identical models. Pseudodeterminism is a signal of interpretability for several reasons that I will go into more detail in a future post:
1. Pseudodeterministic models do not contain any extra random or even pseudorandom information that is not contained in the training data already. This means that when interpreting these models, one does not have to interpret random information.
2. Pseudodeterministic models inherit the symmetry of their training data. For example, if we train a real LSRDR using real symmetric matrices, then the projection $P$ will itself by a symmetric matrix.
3. In mathematics, a well-posed problem is a problem where there exists a unique solution to the problem. Well-posed problems behave better than ill-posed problems in the sense that it is easier to prove results about well-posed problems than it is to prove results about ill-posed problems.
In addition to pseudodeterminism, in my experience, LSRDRs are quite interpretable since I have interpreted LSRDRs already in a few posts:
Interpreting a dimensionality reduction of a collection of matrices as two positive semidefinite block diagonal matrices — LessWrong
When performing a dimensionality reduction on tensors, the trace is often zero. — LessWrong
I have Generalized LSRDRs so that they are starting to behave like deeper neural networks. I am trying to expand the capabilities of generalized LSRDRs so they behave more like deep neural networks, but I still have some work to expand their capabilities while retaining pseudodeterminism. In the meantime, generalized LSRDRs may still function as narrow AI for specific problems and also as layers in AI.
Of course, if we want to compare capabilities, we should also compare NNs to LSRDRs at tasks such as evaluating the cryptographic security of block ciphers, solving NP-complete problems in the average case, etc.
As for the difficulty of this post, it seems like that is the result of the post being mathematical. But going through this kind of mathematics so that we obtain inherently interpretable AI should be the easier portion of AI interpretability. I would much rather communicate about the actual mathematics than about how difficult the mathematics is.

Joseph Van Name May 14, 2025, 5:42 AM
1 point
0
on: Joseph Van Name’s Shortform
In this post, we shall describe 3 related fitness functions with discrete domains where the process of maximizing these functions is pseudodeterministic in the sense that if we locally maximize the fitness function multiple times, then we typically attain the same local maximum; this appears to be an important aspect of AI safety. These fitness functions are my own. While these functions are far from deep neural networks, I think they are still related to AI safety since they are closely related to other fitness functions that are locally maximized pseudodeterministically that more closely resemble deep neural networks.
Let $K$ denote a finite dimensional algebra over the field of real numbers together with an adjoint operation $*$ (the operation $*$ is a linear involution with $(x y)^{*} = y^{*} x^{*}$ ). For example, $K$ could be the field of real numbers, complex numbers, quaternions, or a matrix ring over the reals, complex, or quaternions. We can extend the adjoint $*$ to the matrix ring $M_{r} (K)$ by setting $(x_{i, j})_{i, j}^{*} = (x_{j, i}^{*})_{i, j}$ .
Let $n$ be a natural number. If $A_{1}, \dots, A_{r} \in M_{n} (K), X_{1}, \dots, X_{r} \in M_{d} (K)$ , then define
$Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) : M_{n, d} (K) \to M_{n, d} (K)$ by setting $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) (X) = A_{1} X X_{1}^{*} + \dots + A_{r} X X_{r}^{*}$ .
Suppose now that $1 \leq d < n$ . Then let $S_{d} \subseteq M_{n, n} (K)$ be the set of all $0, 1$ -diagonal matrices with $d$ many $1$ ’s on the diagonal. We observe that each element in $S_{d}$ is an orthogonal projection. Define fitness functions $F_{d}, G_{d}, H_{d} : S_{d} \to R$ by setting
$F_{d} (P) = ρ (Γ (A_{1}, \dots, A_{r}; P A_{1} P, \dots, P A_{r} P))$ ,
$G_{d} (P) = ρ (Γ (P A_{1} P, \dots, P A_{r} P; P A_{1} P, \dots, P A_{r} P))$ , and
$H_{d} (P) = \frac{F_{d} (P)^{2}}{G_{d} (P)}$ . Here, $ρ$ denotes the spectral radius.
$F_{d} (P)$ is typically slightly larger than $G_{d} (P)$ , so these three fitness functions are closely related.
If $P, Q \in S_{d}$ , then we say that $Q$ is in the neighborhood of $P$ if $Q$ differs from $P$ by at most 2 entries. If $F$ is a fitness function with domain $S_{d}$ , then we say that $(P, F (P))$ is a local maximum of the function $F$ if $F (P) \geq F (Q)$ whenever $Q$ is in the neighborhood of $P$ .
The path from initialization to a local maximum $(P_{s}, F (P_{s}))$ for will be a sequence $(P_{0}, \dots, P_{s})$ where $P_{j}$ is always in the neighborhood of $P_{j - 1}$ and where $F (P_{j}) \geq F (P_{j - 1})$ for all $j$ and the length of the path will be $s$ and where $P_{0}$ is generated uniformly randomly.
Empirical observation: Suppose that $F \in {F_{d}, G_{d}, H_{d}}$ . If we compute a path from initialization to local maximum for $F$ , then such a path will typically have length less than $n$ . Furthermore, if we locally maximize $F$ multiple times, we will typically obtain the same local maximum each time. Moreover, if $P_{F}, P_{G}, P_{H}$ are the computed local maxima of $F_{d}, G_{d}, H_{d}$ respectively, then $P_{F}, P_{G}, P_{H}$ will either be identical or differ by relatively few diagonal entries.
I have not done the experiments yet, but one should be able to generalize the above empirical observation to matroids. Suppose that $M$ is a basis matroid with underlying set ${1, \dots, n}$ and where $| A | = d$ for each $A \in M$ . Then one should be able to make the same observation about the fitness functions $F_{d} |_{M}, G_{d} |_{M}, H_{d} |_{M}$ as well.
We observe that the problems of maximizing $F_{d}, G_{d}, H_{d}$ are all NP-complete problems since the clique problems can be reduced to special cases of maximizing $F_{d}, G_{d}, H_{d}$ . This means that the problems of maximizing $F_{d}, G_{d}, H_{d}$ can be sophisticated problems, but this also means that we should not expect it to be easy to find the global maxima for $F_{d}, G_{d}, H_{d}$ in some cases.

Joseph Van Name May 10, 2025, 5:57 AM
1 point
0
on: Joseph Van Name’s Shortform
This is a post about some of the machine learning algorithms that I have been doing experiments with. These machine learning models behave quite mathematically which seems to be very helpful for AI interpretability and AI safety.
Sequences of matrices generally cannot be approximated by sequences of Hermitian matrices.
Suppose that $A_{1}, \dots, A_{r}$ are $n \times n$ -complex matrices and $X_{1}, \dots, X_{r}$ are $d \times d$ -complex matrices. Then define a mapping $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) : M_{n, d} (C) \to M_{n, d} (C)$ by $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) (X) = A_{1} X X_{1}^{*} + \dots + A_{r} X X_{r}^{*}$ for all $X$ . Define
$Φ (A_{1}, \dots, A_{r}) = Γ (A_{1}, \dots, A_{r}; A_{1}, \dots, A_{r})$ . Define the $L_{2}$
-spectral radius by setting $ρ_{2} (A_{1}, \dots, A_{r}) = ρ (Φ (A_{1}, \dots, A_{r}))^{1 / 2}$ . Define the $L_{2}$ -spectral radius similarity between $(A_{1}, \dots, A_{r})$ and $(X_{1}, \dots, X_{r})$ by
$∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$
$= \frac{ρ (Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}))}{ρ_{2} (A_{1}, \dots, A_{r}) ρ_{2} (X_{1}, \dots, X_{r})}$ .
The $L_{2}$ -spectral radius similarity is always in the interval $[0, 1]$ . if $n = d$ , $A_{1}, \dots, A_{r}$ generates the algebra of $n \times n$ -complex matrices, and $X_{1}, \dots, X_{r}$ also generates the algebra of $n \times n$ -complex matrices, then $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2} = 1$ if and only if there are $C, λ$ with $A_{j} = λ C X_{j} C^{- 1}$ for all $j$ .
Define $ρ_{2, d}^{H} (A_{1}, \dots, A_{r})$ to be the supremum of
$ρ_{2} (A_{1}, \dots, A_{r}) \cdot ∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$
where $X_{1}, \dots, X_{r}$ are $d \times d$ -Hermitian matrices.
One can get lower bounds for $ρ_{2, d}^{H} (A_{1}, \dots, A_{r})$ simply by locally maximizing $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$ using gradient ascent, but if one locally maximizes this quantity twice, one typically gets the same fitness level.
Empirical observation/conjecture: If $(A_{1}, \dots, A_{r})$ are $n \times n$ -complex matrices, then $ρ_{2, n}^{H} (A_{1}, \dots, A_{r}) = ρ_{2, d}^{H} (A_{1}, \dots, A_{r})$ whenever $d \geq n$ .
The above observation means that sequences of $n \times n$ -matrices $(A_{1}, \dots, A_{r})$ are fundamentally non-Hermitian. In this case, we cannot get better models of $(A_{1}, \dots, A_{r})$ using Hermitian matrices larger than the matrices $(A_{1}, \dots, A_{r})$ themselves; I kind of want the behavior to be more complex instead of doing the same thing whenever $d \geq n$
, but the purpose of modeling $(A_{1}, \dots, A_{r})$ as Hermitian matrices is generally to use smaller matrices and not larger matrices.
This means that the function $ρ_{2, d}^{H}$ behaves mathematically.
Now, the model $(X_{1}, \dots, X_{r})$ is a linear model of $(A_{1}, \dots, A_{r})$ since the mapping $A_{j} \mapsto X_{j}$ is the restriction of a linear mapping, so such a linear model should be good for a limited number of tasks, but the mathematical behavior of the model $(X_{1}, \dots, X_{r})$ generalizes to multi-layered machine learning models.

Joseph Van Name May 3, 2025, 9:38 AM
9 points
0
on: Joseph Van Name’s Shortform
In this post, I will post some observations that I have made about the octonions that demonstrate that the machine learning algorithms that I have been looking at recently behave mathematically and such machine learning algorithms seem to be highly interpretable. The good behavior of these machine learning algorithms is in part due to the mathematical nature of the octonions and also the compatibility with the octonions and the machine learning algorithm. To be specific, one should think of the octonions as encoding a mixed unitary quantum channel that looks very close to the completely depolarizing channel, but my machine learning algorithms work well with those sorts of quantum channels and similar objects.
Suppose that $K$ is either the field of real numbers, complex numbers, or quaternions.
If $A_{1}, \dots, A_{r} \in M_{m} (K), B_{1}, \dots, B_{r} \in M_{n} (K)$ are matrices, then define an superoperator $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}) : M_{m, n} (K) \to M_{m, n} (K)$
by setting $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}) (X) = A_{1} X B_{1}^{*} + \dots + A_{r} X B_{r}^{*}$
(the domain and range of )and define $Φ (A_{1}, \dots, A_{r}) = Γ (A_{1}, \dots, A_{r}; A_{1}, \dots, A_{r})$ . Define the L_2-spectral radius similarity $∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥_{2}$ by setting
$∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥_{2}$
$= \frac{ρ (Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}))}{ρ (Φ (A_{1}, \dots, A_{r}))^{1 / 2} ρ (Φ (B_{1}, \dots, B_{r}))^{1 / 2}}$ where $ρ$ denotes the spectral radius.
Recall that the octonions are the unique (up-to-isomorphism) 8 dimensional real inner product space $V$ together with a bilinear binary operation $*$ such that $∥ x * y ∥ = ∥ x ∥ \cdot ∥ y ∥$ and $1 * x = x * 1 = x$ for all $x, y \in V$ .
Suppose that $e_{1}, \dots, e_{8}$ is an orthonormal basis for $V$ . Define operators $(A_{1}, \dots, A_{8})$ by setting $A_{i} v = e_{j} * v$ . Now, define operators $(B_{1}, \dots, B_{64})$ up to reordering by setting ${B_{1}, \dots, B_{64}} = {A_{i} \otimes A_{j} : i, j \in {1, \dots, 8}}$ .
Let $d$ be a positive integer. Then the goal is to find complex symmetric $d \times d$ -matrices $(X_{1}, \dots, X_{64})$ where $∥ (A_{1}, \dots, A_{64}) ≃ (X_{1}, \dots, X_{64}) ∥_{2}$ is locally maximized. We achieve this goal through gradient ascent optimization. Since we are using gradient ascent, I consider this to be a machine learning algorithm, but the function mapping $A_{j}$ to $X_{j}$ is a linear transformation, so we are training linear models here (we can generalize this fitness function to one where we train non-linear models though, but that takes a lot of work if we want the generalized fitness functions to still behave mathematically).
Experimental Observation: If $1 \leq d \leq 8$ , then we can easily find complex symmetric matrices $(X_{1}, \dots, X_{64})$ where $∥ (A_{1}, \dots, A_{64}) ≃ (X_{1}, \dots, X_{64}) ∥_{2}$ is locally maximized and where $∥ (A_{1}, \dots, A_{64}) ≃ (X_{1}, \dots, X_{64}) ∥_{2}^{2} = (2 d + 6) / 64 = (d + 3) / 32.$
If $7 \leq d \leq 16$ , then we can easily find complex symmetric matrices $(X_{1}, \dots, X_{64})$ where $∥ (A_{1}, \dots, A_{64}) ≃ (X_{1}, \dots, X_{64}) ∥_{2}$ is locally maximized and where $∥ (A_{1}, \dots, A_{64}) ≃ (X_{1}, \dots, X_{64}) ∥_{2}^{2} = (2 d + 4) / 64 = (d + 2) / 32.$
.

Joseph Van Name Apr 28, 2025, 7:28 AM
1 point
0
on: Joseph Van Name’s Shortform
Here are some observations about the kind of fitness functions that I have been running experiments on for AI interpretability. The phenomena that I state in this post are determined experimentally without a rigorous mathematical proof and they only occur some of the time.
Suppose that $F : X \to [- \infty, \infty)$ is a continuous fitness function. In an ideal universe, we would like for the function $F$ to have just one local maximum. If $F$ has just one local maximum, we say that $F$ is maximized pseudodeterministically (or simply pseudodeterministic). At the very least, we would like for there to be just one real number of the form $F (x)$ for local maximum $(x, F (x))$ . In this case, all local maxima will typically be related by some sort of symmetry. Pseudodeterministic fitness function seem to be quite interpretable to me. If there are many local maximum values and the local maximum value that we attain after training depends on things such as the initialization, then the local maximum will contain random/pseudorandom information independent of the training data, and the local maximum will be difficult to interpret. A fitness function with a single local maximum value behaves more mathematically than a fitness function with many local maximum values, and such mathematical behavior should help with interpretability; the only reason I have been able to interpret pseudodeterminisitic fitness functions before is that they behave mathematically and have a unique local maximum value.
Set $O = F^{- 1} [(- \infty, \infty)] = X ∖ F^{- 1} [{- \infty}]$ . If the set $O$ is disconnected (in a topological sense) and if $L$ behaves differently on each of the components of $L$ , then we have literally shattered the possibility of having a unique local maximum, but in this post, we shall explore a case where each component of $O$ still has a unique local maximum value.
Let $m_{0}, \dots, m_{n}$ be positive integers with $m_{0} = m_{n} = 1$ and where $m_{1} \geq 1, \dots, m_{n - 1} \geq 1$ . Let $r_{0}, \dots, r_{n - 1}$ be other natural numbers. The set $X$ is the collection of all tuples $A = (A_{i, j})_{i, j}$ where each $A_{i, j}$ is a real $m_{i + 1} \times m_{i}$ -matrix and where the indices range from $i \in {0, \dots, n - 1}, j \in {1, \dots, r_{i}}$ and where $(A_{i, j})_{j}$ is not identically zero for all $i$ .
The training data is a set $Σ$ that consists of input/label pairs $(u, v)$ where $v \in {- 1, 1}$ and where $u = (u_{0}, \dots, u_{n - 1})$ such that each $u_{i}$ is a subset of ${1, \dots, r_{i}}$ for all $i$ (i.e. $Σ$ is a binary classifier where $u$ is the encoded network input and $v$ is the label).
Define $W (u, A) = (\sum_{j \in u_{n - 1}} A_{n - 1, j}) \dots (\sum_{j \in u_{0}} A_{0, j})$ . Now, we define our fitness level by setting
$F (A) = \sum_{(u, v) \in Σ} log (| W (u, A) |) / | Σ | - \sum_{i} log (∥ \sum_{j} A_{i, j} A_{i, j}^{*} ∥_{p}) / 2$
$= E (log (| W (u, A) |)) - \sum_{i} log (∥ \sum_{j} A_{i, j} A_{i, j}^{*} ∥_{p}) / 2$ where the expected value is with respect to selecting an element $(u, v) \in Σ$ uniformly at random. Here, $∥ * ∥_{p}$ is a Schatten $p$ -norm which is just the $ℓ_{p}$ -norm of the singular values of the matrix. Observe that the fitness function $F$ only depends on the list $(u : (u, v) \in Σ)$ , so $F$ does not depend on the training data labels.
Observe that $O = X ∖ ⋃_{u \in Σ} {A \in X : W (u, A) = 0}$ which is a disconnected open set. Define a function $f : O \to {- 1, 1}^{Σ}$ by setting $f (A) = (W (u, A) / | W (u, A) |)_{(u, v) \in Σ}$ . Observe that if $x, y$ belong to the same component of $O$ , then $f (x) = f (y)$ .
While the fitness function $F$ has many local maximum values, the function $F$ seems to typically have at most one local maximum value per component. More specifically, for each $(α_{i})_{i \in Σ}$ , the set $f^{- 1} [{(α_{i})_{i \in Σ}}]$ seems to typically be a connected open set where $F$ has just one local maximum value (maybe the other local maxima are hard to find, but if thye are hard to find, they are irrelevant).
Let $Ω = f^{- 1} [{(v)_{(u, v) \in Σ}]$ . Then $Ω$ is a (possibly empty) open subset of $O$ , and there tends to be a unique (up-to-symmetry) $A_{0} \in Ω$ where $F (A_{0})$ is locally maximized. This unique $A_{0}$ is the machine learning model that we obtain when training on the data set $Σ$ . To obtain $A_{0}$ , we first perform an optimization that works well enough to get inside the open set $Ω$ . For example, to get inside $Ω$ , we could try to maximize the fitness function $\sum_{(u, v) \in Σ} arctan (v \cdot W (u, A))$ . We then maximize $F$ inside the open set $Ω$ to obtain our local maximum.
After training, we obtain a function $f$ defined by $f (u) = W (u, A_{0})$ . Observe that the function $f$ is a multi-linear function. The function $f$ is highly regularized, so if we want better performance, we should tone down the amount of regularization, but this can be done without compromising pseudodeterminism. The function $f$ has been trained so that $f (u) / | f (u) | = v$ for each $(u, v) \in Σ$ but also so that $| f (u) |$ is large compared to what we might expect whenever $(u, v) \in Σ$ . In other words, $f$ is helpful in determining whether $(u, v)$ belongs to $Σ$ or not since one can examine the magnitude and sign of $f (u)$ .
In order to maximize AI safety, I want to produce inherently interpretable AI algorithms that perform well on difficult tasks. Right now, the function $f$ (and other functions that I have designed) can do some machine learning tasks, but they are not ready to replace neural networks, but I have a few ideas about how to improve my AI algorithms performance without compromising pseudodeterminism. I do not believe that pseudodeterministic machine learning will increase AI risks too much because when designing these pseudodeterministic algorithms, we are trading some (but hopefully not too much) performance for increased interpretability, but this tradeoff is good for safety by increasing interpretability without increasing performance.

Joseph Van Name Apr 24, 2025, 10:08 AM
3 points
0
on: Joseph Van Name’s Shortform
This post gives an example of some calculations that I did using my own machine learning algorithm. These calculations work out nicely which indicates that the machine learning algorithm I am using is interpretable (and it is much more interpretable than any neural network would be). These calculations show that one can begin with old mathematical structures and produce new mathematical structures, and it seems feasible to completely automate this process to continue to produce more mathematical structures. The machine learning models that I use are linear, but it seems like we can get highly non-trivial results simply by iterating the procedure of obtaining new structures from old using machine learning.
I made a similar post to this one about 7 months ago, but I decided to revisit this experiment with more general algorithms and I have obtained experimental results which I think look nice.
To illustrate how this works, we start off with the octonions. The octonions consists of an 8-dimensional inner product space $V$ together with a bilinear operation $*$ and a unit $1 \in V$ where $1 * v = v * 1 = v$ for all $v \in V$ and where $∥ u * v ∥ = ∥ u ∥ \cdot ∥ v ∥$ for all $u, v \in V$ . The octonions are uniquely determined up to isomorphism from these properties. The operation $*$ is non-associative, but the $*$ is closely related to the quaternions and complex numbers. If we take a single element in $v \in V ∖ Span (1)$ , then ${1, v}$ generates a subalgebra of $(V, *)$ isomorphic to the field of complex numbers, and if $u, v \in V$ and ${1, u, v}$ are linearly independent, then ${1, u, v, u * v}$ spans a subalgebra of $V$ isomorphic to the division ring of quaternions. For this reason, one commonly thinks of the octonions as the best way to extend the division ring of quaternions to a larger algebraic structure in the same way that the quaternions extend the field of complex numbers. But since the octonions are non-associative, they cannot be used to construct matrices, so they are not as well-known as the quaternions (and the construction of the octonions is more complicated too)
Suppose now that ${e_{0}, e_{1}, \dots, e_{7}}$ is an orthonormal basis for the division ring of octonions with $e_{0} = 1$ . Then define matrices $A_{0}, \dots, A_{7} : V \to V$ by setting $A_{j} v = e_{j} * v$ for all $j$ . Our goal is to transform $(A_{0}, \dots, A_{7})$ into other tuples of matrices that satisfy similar properties.
If $(A_{1}, \dots, A_{r}), (B_{1}, \dots, B_{r})$ are matrices, then define the $L_{2}$
-spectral radius similarity between $(A_{1}, \dots, A_{r})$ and $(B_{1}, \dots, B_{r})$ as
$∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥_{2} =$
$\frac{ρ (A_{1} \otimes_{1} + \dots + A_{r} \otimes_{r})}{ρ (A_{1} \otimes_{1} + \dots + A_{r} \otimes_{r})^{1 / 2} \cdot ρ (B_{1} \otimes_{1} + \dots + B_{r} \otimes_{r})^{1 / 2}}$
where $ρ$ denotes the spectral radius, $\otimes$ is the tensor product, and $¯ ¯¯¯ ¯ X$ is the complex conjugate of $X$ applied elementwise.
Let $d \in {1, \dots, 8}$ , and let $F_{d}, G_{d}, H_{d}$ denote the maximum value of the fitness level $8 \cdot ∥ (A_{0}, \dots, A_{7}) ≃ (X_{0}, \dots, X_{7}) ∥^{2}$ such that each $X_{j}$ is a complex $d \times d$ anti-symmetric matrix ( $X = - X^{T}$ ), a complex $d \times d$ symmetric matrix ( $X = X^{T}$ ), and a complex $d \times d$ -Hermitian matrix ( $X = X^{*}$ ) respectively.
The following calculations were obtained through gradient ascent, so I have no mathematical proof that the values obtained are actually correct.
$G_{1} = 2$ , $H_{1} = 1$
$G_{2} = 3$ , $H_{2} = 3$
$F_{3} = 1 + \sqrt{3}$ , $G_{3} = 3.5$ , $H_{3} = 1 + 2 \sqrt{2}$
$F_{4} = 4$ , $G_{4} = 4$ , $H_{4} = 1 + 3 \sqrt{2}$
$F_{5} = (5 + \sqrt{13}) / 2$ , $G_{5} = 4.5$ , $H_{5} \approx 5.27155841$
$F_{6} = 5$ , $G_{6} = 5$ , $H_{6} = 3 + 2 \sqrt{2}$
$F_{7} = 6$ , $G_{7} = 2 + 2 \sqrt{3} \approx 5.4641$ , $H_{7} = 1 + 2 \sqrt{7}$
$F_{8} = 7$ , $G_{8} = 6$ , $H_{8} = 7$
Observe that with at most one exception, all of these values $F_{d}, G_{d}, H_{d}$ are algebraic half integers. This indicates that the fitness function that we maximize to produce $F_{d}, G_{d}, H_{d}$ behaves mathematically and can be used to produce new tuples $(X_{1}, \dots, X_{r})$ from old ones $(A_{1}, \dots, A_{r})$ . Furthermore, an AI can determine whether something notable is going on with the new tuple $(X_{1}, \dots, X_{r})$ in several ways. For example, if $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥^{2}$ has low algebraic degree at the local maximum, then $(X_{1}, \dots, X_{r})$ is likely notable and likely behaves mathematically (and is probably quite interpretable too).
The good behavior of $F_{d}, G_{d}, H_{d}$ demonstrates that the octonions are compatible with the $L_{2}$ -spectral radius similarity. The operators $(A_{0}, \dots, A_{7})$ are all orthogonal, and one can take the tuple $(A_{0}, \dots, A_{7})$ as a mixed unitary quantum channel that is very similar to the completely depolarizing channel. The completely depolarizing channel completely mixes every quantum state while the mixture of orthogonal mappings $(A_{0}, \dots, A_{7})$ completely mixes every real state. The $L_{2}$ -spectral radius similarity works very well with the completely depolarizing channel, so one should expect for the $L_{2}$ -spectral radius similarity to also behave well with the octonions.

Joseph Van Name Apr 23, 2025, 6:31 AM
9 points
0
on: Joseph Van Name’s Shortform
It is time for us to interpret some linear machine learning models that I have been working on. These models are linear, but I can generalize these algorithms to produce multilinear models which have stronger capabilities while still behaving mathematically. Since one can stack the layers to make non-linear models, these types of machine learning algorithms seem to have enough performance to be more relevant for AI safety.
Our goal is to transform a list of $n \times n$ -matrices $(A_{1}, . . ., A_{r})$ into a new and simplified list of $d \times d$ -matrices $(X_{1}, \dots, X_{r})$ . There are several ways in which we would like to simplify the matrices. For example, we would sometimes simply like for $d < n$ , but in other cases, we would like the matrices $X_{j}$ to all be real symmetric, complex symmetric, real Hermitian, complex Hermitian, complex anti-symmetric, etc.
We measure similarity between tuples of matrices using spectral radii. Suppose that $(A_{1}, \dots, A_{r})$ are $n \times n$ -matrices and $(X_{1}, \dots, X_{r})$ are $d \times d$ -matrices. Then define an operator $Γ (A_{1}, \dots, A_{r} : X_{1}, \dots, X_{r})$ mapping $n \times d$ matrices to $n \times d$
-matrices by setting $Γ (A_{1}, \dots, A_{r} : X_{1}, \dots, X_{r}) (X) = A_{1} X X_{1}^{*} + \dots A_{r} X X_{r}^{*}$ . Then define $Φ (X_{1}, \dots, X_{r}) = Γ (X_{1}, \dots, X_{r}; X_{1}, \dots, X_{r})$ . Define the similarity between $(A_{1}, \dots, A_{r})$ and $(X_{1}, \dots, X_{r})$ by setting
$∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$
$= \frac{ρ (Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}))}{ρ (Φ (A_{1}, \dots, A_{r}))^{1 / 2} ρ (Φ (X_{1}, \dots, X_{r}))^{1 / 2}}$
where $ρ$ denotes the spectral radius. Here, $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$ should be thought of as a generalization of the cosine similarity to tuples of matrices. And $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$ is always a real number in $[0, 1]$ , so this is a sensible notion of similarity.
Suppose that $K$ is either the field of real or complex numbers. Let $M_{n} (K)$ denote the set of $n$ by $n$ matrices over $K$ .
Let $n, d$ be positive integers. Let $T : M_{d} (K) \to M_{d} (K)$ denote a projection operator. Here, $T$ is a real-linear operator, but if $K$ is not complex, then $T$ is not necessarily complex linear. Here are a few examples of such linear operators $T$ that work:
$K = C : T_{1} (X) = (X + X^{T}) / 2$ (Complex symmetric)
$K = C : T_{2} (X) = (X - X^{T}) / 2$ (Complex anti-symmetric)
$K = C : T_{3} (X) = (X + X^{*}) / 2$ (Complex Hermitian)
$K = C : T_{4} (X) = Re (X)$ (real, the real part taken elementwise).
$K = R : T_{5} (X) = (X + X^{T}) / 2$ (Real symmetric)
$K = R : T_{6} (X) = (X - X^{T}) / 2$ (Real anti-symmetric)
$K = C : T_{7} (X) = Re (X) + Re (X)^{T}$ (real symmetric)
$K = C : T_{8} (X) = Re (X) - Re (X)^{T}$ (real anti-symmetric)
Caution: These are special projection operators on spaces of matrices. The following algorithms do not behave well for general projection operators; they mainly behave well for $T_{1}, \dots, T_{8}$ along with operators that I have forgotten about.
We are now ready to describe our machine learning algorithm’s input and objective.
Input: $r$ -matrices $A_{1}, \dots, A_{r} \in M_{n} (K)$
Objective: Our goal is to obtain matrices $(X_{1}, \dots, X_{r}) \in M_{d} (K)$ where $T (X_{j}) = X_{j}$ for all $j$ but which locally maximizes the similarity $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$ .
In this case, we shall call $(X_{1}, \dots, X_{r})$ an $L_{2, d}$ -spectral radius dimensionality reduction (LSRDR) along the subspace $im (T) .$
LSRDRs along subspaces often perform tricks and are very well-behaved.
If $(X_{1}, \dots, X_{r}), (Y_{1}, \dots, Y_{r})$ are LSRDRs along subspaces, then there are typically some $λ, C$ where $Y_{j} = λ C X_{j} C^{- 1}$ for all $j$ . Furthermore, if $(X_{1}, \dots, X_{r})$ is an LSRDR along a subspace, then we can typically find some matrices $R, S$ where $X_{j} = T (R A_{j} S)$ for all $j$ .
The model $(X_{1}, \dots, X_{r})$ simplifies since it is encoded into the matrices $R, S$ , but this also means that the model $(X_{1}, \dots, X_{r})$ is a linear model. I have just made these observations about the LSRDRs along subspaces, but they seem to behave mathematically enough for me especially since the matrices $R, S$ tend to have mathematical properties that I can’t explain and am still exploring.

Joseph Van Name Apr 16, 2025, 9:44 PM
−24 points
0
in reply to: dirk’s comment on: College Advice For People Like Me
College students are wasting their time getting an education from evil institutions. Where did you go to college? Are you just defending horrendous institutions just to make your ‘education’ look better than it is? You are gaslighting me into believing that I deserve violence because you are an evil person. I sent you a private message with the horribly inaccurate letter from the university listing some details of the case. Universities defend their horrendous behavior which just tells anyone sensible that universities are horrible garbage institutions. None of these institutions have acknowledged any wrongdoing. And the horrendous attitude of the people from these institutions just indicates how horrendous the education from these places really is.
Only an absolute monster would see me as a problem for calling out universities when they have targeted me with their threats of violence and their bullshit. The Lord Jesus Christ will punish you all for your wickedness.
If the people here are going to act like garbage when I offer criticism of universities for promoting violence, then most college degrees are worth absolutely nothing.
This cannot be an isolated incident when every person promotes violence and hates me for bringing this shit up. If people are offended because I denounce violence, then those people have a worthless education and their universities are fucked up and worthless.
Your own wickedness will cause people who can and know how to help with the problems with AI and similar problems to instead refuse to help or even use their talents to make the situation with AI safety even worse. After all, when humans act like garbage and promote violence, we must side with even a dangerous AI.

Joseph Van Name Apr 16, 2025, 9:22 PM
−20 points
−5
on: College Advice For People Like Me
I was a professor, so my only advice is to not go to college at all. Colleges are extremely unprofessional and refuse to apologize for promoting violence against me. Since colleges are so busy promoting violence and gaslighting me for standing up for my personal safety, they don’t give a shit about your education. Before you all aggressively downvote me for standing up for my own safety, you should learn the FACTS about the situation. Attempts to gaslight me won’t work at all because I have trained my mind to resist those who want to harm me.

Joseph Van Name Apr 9, 2025, 3:26 AM
−3 points
0
on: American College Admissions Doesn’t Need to Be So Competitive
Universities are altogether unprofessional, so it is probably best for everyone to shame them and regard the degrees from these universities are completely worthless. Universities promote violence and they refuse to apologize or acknowledge that there is any problem whatsoever.

Joseph Van Name Apr 6, 2025, 6:06 PM
−8 points
−4
in reply to: NormanPerlmutter’s comment on: NormanPerlmutter’s Shortform
The university that you got your Ph.D. from does not care about basic human rights either. CUNY promotes violence. CUNY refuses to apologize. This does not make you look good at all. Other universities refuse to acknowledge that there is a problem. So the problem seems to be that Trump sends his cryptocurrency scammer kids to these universities that are too afraid to give them the bad grades that they deserve. Trump should instead focus his efforts on defunding universities and giving them a hard time until they apologize for the bad things that they do and correct their bad behavior with demonstrable results.

Joseph Van Name Feb 8, 2025, 3:49 AM
3 points
0
on: Joseph Van Name’s Shortform
Since AI interpretability is a big issue for AI safety, let’s completely interpret the results of evolutionary computation.
Disclaimer: This interpretation of the results of AI does not generalize to interpreting deep neural networks. This is a result for interpreting a solution to a very specific problem that is far less complicated than deep learning, and by interpreting, I mean that we iterate a mathematical operation hundreds of times to get an object that is simpler than our original object, so don’t get your hopes up too much.
A basis matroid is a pair $(X, M)$ where $X$ is a finite set, and $M \subseteq P (X)$ where $M$ denotes the power set of $X$ that satisfies the following two properties:
1. If $A, B \in M$ , then $| A | = | B |$ .
2. if $A, B \in M, A \neq B, a \in A ∖ B$ , then there is some $b \in B ∖ A$ with $(A ∖ {a}) \cup {b} \in M$ (the basis exchange property).
I ran a computer experiment where I obtained a matroid $(X, M)$ where $| X | = 9$ $| M | = 68$ and where each element in $M$ has size $4$ through evolutionary computation, but the population size was kept so low that this evolutionary computation mimicked hill climbing algorithms. Now we need to interpret the matroid $(X, M)$ .
The notion of a matroid has many dualities. Our strategy is to apply one of these dualities to the matroid $(X, M)$ so that the dual object is much smaller than the original object $(X, M)$ . One may formulate the notion of a matroid in terms of closure systems (flats),hyperplanes, closure operators, lattices, a rank function, independent sets, bases, and circuits. If these seems to complicated, many of these dualities are special cases of other dualities common with ordered sets. For example, the duality between closure systems, closure operators, and ordered sets applies to contexts unrelated to matroids such as in general and point-free topology. And the duality between the basis, circuit, and the hyperplanes may be characterized in terms of rowmotion.
If $(Z, \leq)$ is a partially ordered set, then a subset $A \subseteq Z$ is said to be an antichain if whenever $a, b \in A, a \leq b$ , then $a = b$ . In other words, an antichain is a subset $A$ of $Z$ where the restriction of $\leq$ to $A$ is equality. We say that a aubset $L$ of $Z$ is downwards closed if whenever $x \leq y$ and $y \in L$ , then $x \in L$ as well. If $A \subseteq Z$ , then let $L (A)$ denote the smallest downwards closed subset of $Z$ containing $A$ . Suppose that $Z$ is a finite poset. If $A$ is an antichain in $Z$ , then let $A^{'}$ denote the set of all minimal elements in $Z ∖ L (A)$ . Then $A^{'}$ is an antichain as well, and the mapping $A \mapsto A^{'}$ is a bijection from the set of all antichains in $Z$ to itself. This means that if $A$ is an antichain, then we may define $A^{(n)}$ for all integers $n$ by setting $A^{(0)} = A, A^{(n + 1)} = (A^{(n)})^{'}$ .
If $(X, M)$ is a basis matroid, then $M$ is an antichain in $P (X)$ , so we may apply rowmotion, so we say that $(X, M^{(n)})$ is an $n$ -matroid. In this case, the $1$
-matroids are the circuit matroids while the $- 1$ -matroids are the hyperplane matroids. Unfortunately, the $n$ -matroids have not been characterized for $| n | > 1$ . We say that the rowmotion order of $(X, M)$ is the least positive integer $n$ where $M^{(n)} = M$ . If $(X, M)$ is a matroid of order $n$ , then my computer experiments indicate that $gcd (| X | + 2, n) > 1$ whichs lends support to the idea that the rowmotion of a matroid is a sensible mathematical notion that may be satisfied mathematically. The notion of rowmotion of a matroid also appears to be a sensible mathematical notion for other reasons; if we apply iteratively apply a bijective operation $g$ (such as a reversible cellular automaton) to a finite object $x$ , then that bijective operation will often increase the entropy in the sense that if $x$ has low entropy, then $g^{(n)} (x)$ will typically have a high amount of entropy and look like noise. But this is not the case with matroids as $n$ -matroids do not appear substantially more complicated than basis matroids. Until and if there is a mundane explanation for this behavior of the rowmotion of matroids, I must consider the notion of rowmotion of matroids to be a mathematically interesting notion even though it is currently not understood by anyone.
With the matroid $(X, M)$ obtained from evolutionary computation, I found that $(X, M)$ has order $1958$ which factorizes as $1958 = 2 \cdot 79 \cdot 11$ . Set $X = {1, \dots, 9}$ . By applying rowmotion to this matroid, I found that $M^{(342)}$ ={{1, 8, 9},{2, 3, 6, 8},{2, 3, 7, 9},{4, 5},{4, 6, 9},{4, 7, 8},{5, 6, 9},{5, 7, 8}}. If $(X, M^{(m)})$ is a basis matroid, then $M^{(m)} = M$ , so the set $M^{(342)}$ is associated with a unique basis matroid. This is the smallest way to represent $(X, M)$ in terms of rowmotion since if $| M^{(n)} | \leq 8$ , then $M^{(n)} = M^{(342)}$ .
I consider this a somewhat satisfactory interpretation of the matroid $(X, M)$ that I have obtained through evolutionary computation, but there is still work to do because nobody has researched the rowmotion operation on matroids and because it would be better to simplify a matroid without needing to go through hundreds of layers of rowmotion. And even if we were to understand matroid rowmotion better, this would not help us too much with AI safety since here this interpretability of the result of evolutionary computation does not generalize to other AI’s and it certainly does not apply to deep neural networks.
I made a video here where one may see the rowmotion of this matroid and that video is only slightly interpretable.
Deep matroid duality visualization: Rowmotion of a matroid
It turns out that evolutionary computation is not even necessary to construct matroids since Donald Knuth has produced an algorithm that can be used to construct an arbitrary matroid in his 1975 paper on random matroids. But I applied the rowmotion to the matroid in his paper and the resulting 10835-matroid $B^{(10835)}$ ={{1, 2, 4, 5},{1, 2, 6, 10},{1, 3, 4, 6},{1, 3, 4, 7, 9},{1, 3, 6, 7, 9},{1, 4, 6, 7},{1, 4, 6, 9},{1, 4, 8, 10},{2, 3, 4, 5, 6, 7, 8, 9, 10}}. It looks like the rowmotion operation is good for simplifying matroids in general. We can uniquely recover the basis matroid from the 10835 matroid since $B^{(m)}$ is not a basis matroid whenever $0 < m \leq 10835$ .

Joseph Van Name Oct 29, 2024, 5:52 PM
3 points
0
on: Joseph Van Name’s Shortform
I have originally developed a machine learning notion which I call an LSRDR ( $L_{2, d}$
-spectral radius dimensionality reduction), and LSRDRs (and similar machine learning models) behave mathematically and they have a high level of interpretability which should be good for AI safety. Here, I am giving an example of how LSRDRs behave mathematically and how one can get the most out of interpreting an LSRDR.
Suppose that $n$ is a natural number. Let $N$ denote the quantum channel that takes an $n$ qubit quantum state and selects one of those qubits at random and send that qubit through the completely depolarizing channel (the completely depolarizing channel takes a state as input and returns the completely mixed state as an output).
If $A_{1}, \dots, A_{r}, B_{1}, \dots, B_{r}$ are complex matrices, then define superoperators $Φ (A_{1}, \dots, A_{r})$ and $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r})$ by setting
$Φ (A_{1}, \dots, A_{r}) (X) = \sum_{k = 1}^{r} A_{k} X A_{k}^{*}$ and $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}) = \sum_{k = 1}^{r} A_{k} X B_{k}^{*}$ for all $X$ .
Given tuples of matrices $(A_{1}, \dots, A_{r}), (B_{1}, \dots, B_{r})$ , define the L_2-spectral radius similarity between these tuples of matrices by setting
$∥ ∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥_{2}$
$= \frac{ρ (Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}))}{ρ (Φ (A_{1}, \dots, A_{r}))^{1 / 2} ρ (Φ (B_{1}, \dots, B_{r}))^{1 / 2}}$ .
Suppose now that $A_{1}, \dots, A_{4 n}$ are matrices where $N = Φ (A_{1}, \dots, A_{4 n})$ . Let $1 \leq d \leq n$ . We say that a tuple of complex $d$ by $d$ matrices $(X_{1}, \dots, X_{4 n})$ is an LSRDR of $A_{1}, \dots, A_{4 n}$ if the quantity $∥ (A_{1}, \dots, A_{4 n}) ≃ (X_{1}, \dots, X_{4 n}) ∥_{2}$ is locally maximized.
Suppose now that $X_{1}, \dots, X_{4 n}$ are complex $2 \times 2$ -matrices and $(X_{1}, \dots, X_{4 n})$ is an LSRDR of $A_{1}, \dots, A_{4 n}$ . Then my computer experiments indicate that there will be some constant $λ$ where $λ Γ (A_{1}, \dots, A_{4 n}; X_{1}, \dots, X_{4 n})$ is similar to a positive semidefinite operator with eigenvalues ${0, \dots, n + 1}$ and where the eigenvalue $j$ has multiplicity $3 \cdot C (n - 1, k) + C (n - 1, k - 2)$ where $C (\cdot, \cdot)$ denotes the binomial coefficient. I have not had a chance to try to mathematically prove this. Hooray. We have interpreted the LSRDR $(X_{1}, \dots, X_{4 n})$ of $(A_{1}, \dots, A_{4 n})$ , and I have plenty of other examples of interpreted LSRDRs.
We also have a similar pattern for the spectrum of $Φ (A_{1}, \dots, A_{4 n})$ . My computer experiments indicate that there is some constant $λ$ where $λ \cdot Φ (A_{1}, \dots, A_{4 n})$ has spectrum ${0, \dots, n}$ where the eigenvalue $j$ has multiplicity $3^{n - j} \cdot C (n, j)$ .

Joseph Van Name Oct 28, 2024, 11:02 PM
1 point
0
on: Joseph Van Name’s Shortform
In this note, I will continue to demonstrate not only the ways in which LSRDRs ( $L_{2, d}$ -spectral radius dimensionality reduction) are mathematical but also how one can get the most out of LSRDRs. LSRDRs are one of the types of machine learning that I have been working on, and LSRDRs have characteristics that tell us that LSRDRs are often inherently interpretable which should be good for AI safety.
Suppose that $N$ is the quantum channel that maps a $n$ qubit state to a $n$ qubit state where we select one of the 6 qubits at random and send it through the completely depolarizing channel (the completely depolarizing channel takes a state as an input and returns the completely mixed state as an output). Suppose that $A_{1}, \dots, A_{4 n}$ are $2^{n}$ by $2^{n}$ matrices where $N$ has the Kraus representation $N (X) = \sum_{k = 1}^{4 n} A_{k} X A_{k}^{*}$ .
The objective is to locally maximize the fitness level $ρ (\sum_{k = 1}^{4 n} z_{k} A_{k}) / ∥ (z_{1}, \dots, z_{4 n}) ∥$ where the norm in question is the Euclidean norm and where $ρ$ denotes the spectral radius. This is a 1 dimensional case of an LSRDR of the channel $N$ .
Let $A = \sum_{k = 1}^{4 n} z_{k} A_{k}$ when $(z_{1}, \dots, z_{4 n})$ is selected to locally maximize the fitness level. Then my empirical calculations show that there is some $λ$ where $λ \sum_{k = 1}^{4 n} z_{k} A_{k}$ is positive semidefinite with eigenvalues ${0, \dots, n}$ and where the eigenvalue $k$ has multiplicity $(\frac{n}{k})$ which is the binomial coefficient. But these are empirical calculations for select values $λ$ ; I have not been able to mathematically prove that this is always the case for all local maxima for the fitness level (I have not tried to come up with a proof).
Here, we have obtained a complete characterization of $A$ up-to-unitary equivalence due to the spectral theorem, so we are quite close to completely interpreting the local maximum for our fitness function.
I made a few YouTube videos showcasing the process of maximizing the fitness level here.
Spectra of 1 dimensional LSRDRs of 6 qubit noise channel during training
Spectra of 1 dimensional LSRDRs of 7 qubit noise channel during training
Spectra of 1 dimensional LSRDRs of 8 qubit noise channel during training
I will make another post soon about more LSRDRs of a higher dimension of the same channel $N$ .

Joseph Van Name Sep 18, 2024, 11:37 PM
1 point
0
on: Joseph Van Name’s Shortform
I personally like my machine learning algorithms to behave mathematically especially when I give them mathematical data. For example, a fitness function with apparently one local maximum value is a mathematical fitness function. It is even more mathematical if one can prove mathematical theorems about such a fitness function or if one can completely describe the local maxima of such a fitness function. It seems like fitness functions that satisfy these mathematical properties are more interpretable than the fitness functions which do not, so people should investigate such functions for AI safety purposes.
My notion of an LSRDR is a notion that satisfies these mathematical properties. To demonstrate the mathematical behavior of LSRDRs, let’s see what happens when we take an LSRDR of the octonions.
Let $K$ denote either the field of real numbers or the field of complex numbers ( $K$
could also be the division ring of quaternions, but for simplicity, let’s not go there). If $A_{1}, \dots, A_{r}$ are $n \times n$ -matrices over $K$ , then an LSRDR ( $L_{2, d}$ -spectral radius dimensionality reduction) of $A_{1}, \dots, A_{r}$ is a collection $X_{1}, \dots, X_{r}$ of $d \times d$ -matrices that locally maximizes the fitness level
$\frac{ρ (A_{1} \otimes_{1} + \dots + A_{r} \otimes_{r})}{ρ (X_{1} \otimes_{1} + \dots + X_{r} \otimes_{r})^{1 / 2}}$ . $ρ$ denotes the spectral radius function while $\otimes$ denotes the tensor product and $¯ ¯¯ ¯ Z$ denotes the matrix obtained from $Z$ by replacing each entry with its complex conjugate. We shall call the maximum fitness level the $L_{2, d}$ -spectral radius of $A_{1}, \dots, A_{r}$ over the field $K$ , and we shall wrote $ρ_{2, d}^{K} (A_{1}, \dots, A_{r})$ for this spectral radius.
Define the linear superoperator $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r})$ by setting
$Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) (X) = A_{1} X X_{1}^{*} + \dots + A_{r} X X_{r}^{*}$ and set $Φ (X_{1}, \dots, X_{r}) = Γ (X_{1}, \dots, X_{r}; X_{1}, \dots, X_{r})$ . Then the fitness level of $X_{1}, \dots, X_{r}$ is $\frac{ρ (Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}))}{Φ (X_{1}, \dots, X_{r})^{1 / 2}}$ .
Suppose that $V$ is an $8$ -dimensional real inner product space. Then the octonionic multiplication operation is the unique up-to-isomorphism bilinear binary operation $*$ on $V$ together with a unit $1$ such that $∥ x * y ∥ = ∥ x ∥ \cdot ∥ y ∥$ and $1 * x = x * 1 = 1$ for all x $, y \in V$ . If we drop the condition that the octonions have a unit, then we do not quite have this uniqueness result.
We say that an octonion-like algbera is a $8$ -dimensional real inner product space $V$ together with a unique up-to-isomorphism bilinear operation $*$ such that $∥ x * y ∥ = ∥ x ∥ \cdot ∥ y ∥$ for all $x, y$ .
Let $V$ be a specific octonion-like algebra.
Suppose now that $e_{1}, \dots, e_{8}$ is an orthonormal basis for $V$ (this does not need to be the standard basis). Then for each $j \in {1, \dots, 8}$ , let $A_{j}$ be the linear operator from $V$ to $V$ defined by setting $A_{j} v = e_{j} * v$ for all vectors $v$ . All non-zero linear combinations of $A_{1}, \dots, A_{8}$ are conformal mappings (this means that they preserve angles). Now that we have turned the octonion-like algebra into matrices, we can take an LSRDR of the octonion-like algebras, but when taking the LSRDR of octonion-like algebras, we should not worry about the choice of orthonormal basis $e_{1}, \dots, e_{8}$ since I could formulate everything in a coordinate-free manner.
Empirical Observation from computer calculations: Suppose that $1 \leq d \leq 8$ and $K$ is the field of real numbers. Then the following are equivalent.
1. The $d \times d$ matrices $X_{1}, \dots, X_{8}$ are a LSRDR of $A_{1}, \dots, A_{8}$ over $K$ where $A_{1} \otimes X_{1} + \dots + A_{8} \otimes X_{8}$ has a unique real dominant eigenvalue.
2. There exists matrices $R, S$ where $X_{j} = R A_{j} S$ for all $j$ and where $S R$ is an orthonormal projection matrix.
In this case, $ρ_{2, d}^{K} (A_{1}, \dots, A_{8}) = \sqrt{d}$ and this fitness level is reached by the matrices $X_{1}, \dots, X_{8}$ in the above equivalent statements. Observe that the superoperator $Γ (A_{1}, \dots, A_{8}; P A_{1} P, \dots, P A_{8} P)$ is similar to a direct sum of $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}))$ and a zero matrix. But the projection matrix $P$ is a dominant eigenvector of $Γ (A_{1}, \dots, A_{8}; P A_{1} P, \dots, P A_{8} P)$ and of $Φ (P A_{1} P, \dots, P A_{8} P)$ as well.
I have no mathematical proof of the above fact though.
Now suppose that $K = C$ . Then my computer calculations yield the following complex $L_{2, d}$ -spectral radii: $(ρ_{2, j}^{K} (A_{1}, \dots, A_{8}))_{j = 1}^{8}$
$= (2, 4, 2 + \sqrt{8}, 5.4676355784..., 6.1977259251..., 4 + \sqrt{8}, 7.2628726081..., 8)$
Each time that I have trained a complex LSRDR of $A_{1}, \dots, A_{8}$ , I was able to find a fitness level that is not just a local optimum but also a global optimum.
In the case of the real LSRDRs, I have a complete description of the LSRDRs of $(A_{1}, \dots, A_{8})$ . This demonstrates that the octonion-like algebras are elegant mathematical structures and that LSRDRs behave mathematically in a manner that is compatible with the structure of the octonion-like algebras.
I have made a few YouTube videos that animate the process of gradient ascent to maximize the fitness level.
Edit: I have made some corrections to this post on 9/22/2024.
Fitness levels of complex LSRDRs of the octonions (youtube.com)

Joseph Van Name Aug 14, 2024, 10:51 AM
1 point
0
in reply to: Joseph Van Name’s comment on: Joseph Van Name’s Shortform
Here is an example of what might happen. Suppose that for each $u_{j}$ , we select a orthonormal basis $e_{j, 1}, \dots, e_{j, s}$ of unit vectors for $V$ . Let $R = {(u_{j}, e_{j, k}) : 1 \leq j \leq n, 1 \leq k \leq s}$ . Then
Then for each quantum channel $E$ , by the concavity of the logarithm function (which is the arithmetic-geometric mean inequality), we have
$L (R, E) = \sum_{j = 1}^{n} \sum_{k = 1}^{n} - log (E (u_{j} u_{j}^{*}) e_{j, k}, e_{j, k} ⟩)$
$\leq \sum_{j = 1}^{n} - log (\sum_{k = 1}^{n} ⟨ E (u_{j} u_{j}^{*}) e_{j, k}, e_{j, k} ⟩)$
$= \sum_{j = 1}^{n} - log (Tr (E))$ . Here, equality is reached if and only if
$E (u_{j} u_{j}^{*}) e_{j, k}, e_{j, k} ⟩ = E (u_{j} u_{j}^{*}) e_{j, l}, e_{j, l} ⟩$ for each $j, k, l$ , but this equality can be achieved by the channel
defined by $E (X) = Tr (X) \cdot I / s$ which is known as the completely depolarizing channel. This is the channel that always takes a quantum state and returns the completely mixed state. On the other hand, the channel $E$ has maximum Choi rank since the Choi representation of $E$ is just the identity function divided by the rank. This example is not unexpected since for each input of $R$ the possible outputs span the entire space $V$ evenly, so one does not have any information about the output from any particular input except that we know that the output could be anything. This example shows that the channels that locally minimize the loss function $L (R, E)$ are the channels that give us a sort of linear regression of $R$ but where this linear regression takes into consideration uncertainty in the output so the regression of a output of a state is a mixed state rather than a pure state.