Joseph Van Name

Karma: 76

Joseph Van Name Jan 4, 2024, 6:23 PM
1 point
−5
in reply to: FireStormOOO’s comment on: Apologizing is a Core Rationalist Skill
If you have any questions about the notation or definitions that I have used, you should ask about it in the mathematical posts that I have made and not here. Talking about it here is unhelpful, condescending, and it just shows that you did not even attempt to read my posts. That will not win you any favors with me or with anyone who cares about decency.
Karma is not only imperfect, but Karma has absolutely no relevance whatsoever because Karma can only be as good as the community here.
P.S. Asking a question about the notation does not even signify any lack of knowledge since a knowledgeable person may ask questions about the notation because the knowledgeable person thinks that the post should not assume that the reader has that background knowledge.
P.P.S. I got downvotes, so I got enough engagement on the mathematics. The problem is the community here thinks that we should solve problems with AI without using any math for some odd reason that I cannot figure out.

Joseph Van Name Jan 4, 2024, 3:31 PM
0 points
0
in reply to: Simon Fischer’s comment on: Apologizing is a Core Rationalist Skill
I am pointing out something wrong with the community here. The name of this site is LessWrong. On this site, it is better to acknowledge wrongdoing so that the people here do not fall into traps like FTX again. If you read the article, you would know that it is better to acknowledge wrongdoing or a community weakness than to double down.

Joseph Van Name Jan 4, 2024, 3:29 PM
1 point
0
in reply to: Algon’s comment on: Apologizing is a Core Rationalist Skill
I already did that. But it seems like the people here simply do not want to get into much mathematics regardless of how closely related to interpretability it is.
P.S. If anyone wants me to apply my techniques to GPT, I would much rather see the embedding spaces as more organized objects. I cannot deal very well with words that are represented as vectors of length 4096 very well. I would rather deal with words that are represented as 64 by 64 matrices (or with some other dimensions). If we want better interpretability, the data needs to be structured in a more organized fashion so that it is easier to apply interpretability tools to the data.

Joseph Van Name Jan 3, 2024, 12:20 PM
0 points
0
on: Apologizing is a Core Rationalist Skill
“Lesswrong has a convenient numerical proxy-metric of social status: site karma.”-As long as I get massive downvotes for talking correctly about mathematics and using it to create interpretable AI systems, we should all regard karma as a joke. Karma can only be as good as the community here.

Joseph Van Name Dec 18, 2023, 11:34 PM
1 point
0
on: Joseph Van Name’s Shortform
Let’s compute some inner products and gradients.
Set up: Let $K$ denote either the field of real or the field of complex numbers. Suppose that $d_{1}, \dots, d_{r}$ are positive integers. Let $m_{0}, \dots, m_{n}$ be a sequence of positive integers with $m_{0} = m_{n} = 1$ . Suppose that $X_{i, j}$ is an $m_{i - 1} \times m_{i}$ -matrix whenever $1 \leq j \leq d_{i}$ . Then from the matrices $X_{i, j}$ , we can define a $d_{1} \times \dots \times d_{r}$ -tensor $T ((X_{i, j})_{i, j}) = (X_{1, i_{1}} \dots X_{n, i_{n}})_{i_{1}, \dots, i_{n}}$ . I have been doing computer experiments where I use this tensor to approximate other tensors by minimizing the $ℓ_{2}$ -distance. I have not seen this tensor approximation algorithm elsewhere, but perhaps someone else has produced this tensor approximation construction before. In previous shortform posts on this site, I have given evidence that the tensor dimensionality reduction behaves well, and in this post, we will focus on ways to compute with the tensors $T ((X_{i, j})_{i, j})$ , namely the inner product of such tensors and the gradient of the inner product with respect to the matrices $(X_{i, j})_{i, j}$ .
Notation: If $A_{1}, \dots, A_{r}, B_{1}, \dots, B_{r}$ are matrices, then let $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r})$ denote the superoperator defined by letting $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}) (X) = A_{1} X B_{1}^{*} + \dots + A_{r} X B_{r}^{*}$ . Let $Φ (A_{1}, \dots, A_{r}) = Γ (A_{1}, \dots, A_{r}; A_{1}, \dots, A_{r})$ .
Inner product: Here is the computation of the inner product of our tensors.
$⟨ T ((A_{i, j})_{i, j}), T ((B_{i, j})_{i, j}) ⟩$
$= ⟨ (A_{1, i_{1}} \dots A_{n, i_{n}})_{i_{1}, \dots, i_{n}}, (B_{1, i_{1}} \dots B_{n, i_{n}})_{i_{1}, \dots, i_{n}} ⟩$
$= \sum_{i_{1}, \dots, i_{n}} A_{1, i_{1}} \dots A_{n, i_{n}} (B_{1, i_{1}} \dots B_{n, i_{n}})^{*}$
$= \sum_{i_{1}, \dots, i_{n}} A_{1, i_{1}} \dots A_{n, i_{n}} B_{n, i_{n}}^{*} \dots B_{1, i_{1}}^{*}$
$= Γ (A_{1, 1}, \dots, A_{1, d_{1}}; B_{1, 1}, \dots, B_{1, d_{1}}) \dots Γ (A_{n, 1}, \dots, A_{n, d_{n}}; B_{n, 1}, \dots, B_{n, d_{n}})$ .
In particular, $∥ T ((A_{i, j})_{i, j}) ∥^{2} = Φ (A_{1, 1}, \dots, A_{1, d_{1}}) \dots Φ (A_{n, 1}, \dots, A_{n, d_{n}})$ .
Gradient: Observe that $\nabla_{X} Tr (A X) = A^{T}$ . We will see shortly that the cyclicity of the trace is useful for calculating the gradient. And here is my manual calculation of the gradient of the inner product of our tensors.
$\nabla_{X_{α, β}} ⟨ T ((X_{i, j})_{i, j}), T ((A_{i, j})_{i, j}) ⟩$
$= \nabla_{X_{α, β}} \sum_{i_{1}, \dots, i_{n}} X_{1, i_{1}} \dots X_{n, i_{n}} A_{n, i_{n}}^{*} \dots A_{1, i_{1}}^{*}$
$= \nabla_{X_{α, β}} Tr (\sum_{i_{1}, \dots, i_{n}} X_{1, i_{1}} \dots X_{n, i_{n}} A_{n, i_{n}}^{*} \dots A_{1, i_{1}}^{*})$
$= \nabla_{X_{α, β}} Tr (\sum_{i_{1}, \dots, i_{n}} X_{α, i_{α}} \dots X_{n, i_{n}} A_{n, i_{n}}^{*} \dots$
$A_{α + 1, i_{α + 1}}^{*} A_{α, i_{α}}^{*} A_{α - 1, i_{α - 1}}^{*} \dots A_{1, i_{1}}^{*} X_{1, i_{1}} \dots X_{α - 1, i_{α - 1}})$
$= \nabla_{X_{α, β}} Tr (\sum_{i_{α + 1}, \dots, i_{n}, i_{1}, \dots, i_{α - 1}} X_{α, β} \dots X_{n, i_{n}} A_{n, i_{n}}^{*} \dots$
$A_{α + 1, i_{α + 1}}^{*} A_{α, β}^{*} A_{α - 1, i_{α - 1}}^{*} \dots A_{1, i_{1}}^{*} X_{1, i_{1}} \dots X_{α - 1, i_{α - 1}})$
$= (\sum_{i_{α + 1}, \dots, i_{n}, i_{1}, \dots, i_{α - 1}} X_{α + 1, i_{α + 1}} \dots X_{n, i_{n}} A_{n, i_{n}}^{*} \dots$
$A_{α + 1, i_{α + 1}}^{*} A_{α, β}^{*} A_{α - 1, i_{α - 1}}^{*} \dots A_{1, i_{1}}^{*} X_{1, i_{1}} \dots X_{α - 1, i_{α - 1}})^{T}$
$= (\sum_{i_{α + 1}, \dots, i_{n}, i_{1}, \dots, i_{α - 1}} X_{α + 1, i_{α + 1}} \dots X_{n, i_{n}}$
$A_{n, i_{n}}^{*} \dots A_{α + 1, i_{α + 1}}^{*} A_{α, β}^{*} A_{α - 1, i_{α - 1}}^{*} \dots A_{1, i_{1}}^{*} X_{1, i_{1}} \dots X_{α - 1, i_{α - 1}})^{T}$
$= [(Γ (X_{α + 1, 1}, \dots, X_{α + 1, d_{α + 1}}; A_{α + 1, 1}, \dots, A_{α + 1, d_{α + 1}}) \dots$
$Γ (X_{n, 1}, \dots, X_{n, d_{n}}; A_{n, 1}, \dots, A_{n, d_{n}}) (1))$
$A_{α, β}^{*}$
$((Γ (A_{α - 1, 1}^{*}, \dots, A_{α - 1, d_{α - 1}}^{*}; X_{α - 1, 1}^{*}, \dots, X_{α - 1, d_{α - 1}}^{*}) \dots$
$Γ (A_{1, 1}^{*}, \dots, A_{1, d_{1}}^{*}; X_{1, 1}^{*}, \dots, X_{1, d_{1}}^{*}) (1))]^{T}$ .

Joseph Van Name Dec 17, 2023, 9:45 PM
1 point
0
on: Joseph Van Name’s Shortform
So in my research into machine learning algorithms, I have stumbled upon a dimensionality reduction algorithm for tensors, and my computer experiments have so far yielded interesting results. I am not sure that this dimensionality reduction is new, but I plan on generalizing this dimensionality reduction to more complicated constructions that I am pretty sure are new and am confident would work well.
Suppose that $K$ is either the field of real numbers or the field of complex numbers. Suppose that $d_{1}, \dots, d_{n}$ are positive integers and $(m_{0}, \dots, m_{n})$ is a sequence of positive integers with $m_{0} = m_{n} = 1$ . Suppose that $X_{i, j}$ is an $m_{i - 1} \times m_{i}$ -matrix whenever $1 \leq j \leq d_{i}$ . Then define a tensor $T ((X_{i, j})) = (X_{1, i_{1}} \dots X_{n, i_{n}})_{i_{1}, \dots, i_{n}} \in K^{d_{1}} \otimes \dots \otimes K^{d_{n}}$ .
If $v \in K^{d_{1}} \otimes \dots \otimes K^{d_{n}}$ , and $(X_{i, j})_{i, j}$ is a system of matrices that minimizes the value $∥ v - T ((X_{i, j})) ∥$ , then $T ((X_{i, j})_{i, j})$ is a dimensionality reduction of $(X_{i, j})_{i, j}$ , and we shall denote let $u$ denote the tensor of reduced dimension $T ((X_{i, j})_{i, j})$ . We shall call $u$ a matrix table to tensor dimensionality reduction of type $(m_{0}, \dots, m_{n})$ .
Observation 1: (Sparsity) If $v$ is sparse in the sense that most entries in the tensor $v$ are zero, then the tensor $u$ will tend to have plenty of zero entries, but as expected, $u$ will be less sparse than $v$ .
Observation 2: (Repeated entries) If $v$ is sparse and $v = (x_{i_{1}, \dots, i_{n}})_{i_{1}, \dots, i_{n}}$ and the set ${x_{i_{1}, \dots, i_{n}} : i_{1}, \dots, i_{n}}$ has small cardinality, then the tensor $u$ will contain plenty of repeated non-zero entries.
Observation 3: (Tensor decomposition) Let $v$ be a tensor. Then we can often find a matrix table to tensor dimensionality reduction $u$ of type $(m_{0}, \dots, m_{n})$ so that $v - u$ is its own matrix table to tensor dimensionality reduction.
Observation 4: (Rational reduction) Suppose that $v$ is sparse and the entries in $v$ are all integers. Then the value $∥ u - v ∥^{2}$ is often a positive integer in both the case when $u$ has only integer entries and in the case when $u$ has non-integer entries.
Observation 5: (Multiple lines) Let $m$ be a fixed positive even number. Suppose that $v$ is sparse and the entries in $v$ are all of the form $r \cdot e^{2 π i n / m}$ for some integer $n$ and $r \geq 0$ . Then the entries in $u$ are often exclusively of the form $r \cdot e^{2 π i n / m}$ as well.
Observation 6: (Rational reductions) I have observed a sparse tensor $v$ all of whose entries are integers along with matrix table to tensor dimensionality reductions $u_{1}, u_{2}$ of $v$ where $∥ v - u_{1} ∥ = 3, ∥ v - u_{1} ∥ = 2, ∥ u_{2} - u_{1} ∥ = 5$ .
This is not an exclusive list of all the observations that I have made about the matrix table to tensor dimensionality reduction.
From these observations, one should conclude that the matrix table to tensor dimensionality reduction is a well-behaved machine learning algorithm. I hope and expect this machine learning algorithm and many similar ones to be used to both interpret the AI models that we have and will have and also to construct more interpretable and safer AI models in the future.

Joseph Van Name Dec 14, 2023, 11:23 PM
2 points
0
on: Joseph Van Name’s Shortform
So in my research into machine learning algorithms that I can use to evaluate small block ciphers for cryptocurrency technologies, I have just stumbled upon a dimensionality reduction for tensors in tensor products of inner product spaces that according to my computer experiments exists, is unique, and which reduces a real tensor to another real tensor even when the underlying field is the field of complex numbers. I would not be too surprised if someone else came up with this tensor dimensionality reduction before since it has a rather simple description and it is in a sense a canonical tensor dimensionality reduction when we consider tensors as homogeneous non-commutative polynomials. But even if this tensor dimensionality reduction is not new, this dimensionality reduction algorithm belongs to a broader class of new algorithms that I have been studying recently such as LSRDRs.
Suppose that $K$ is either the field of real numbers or the field of complex numbers. Let $V_{1}, \dots, V_{n}$ be finite dimensional inner product spaces over $K$ with dimensions $d_{1}, \dots, d_{n}$ respectively. Suppose that $V_{i}$ has basis $e_{i, 1}, \dots, e_{i, d_{i}}$ . Given $v \in V_{1} \otimes \dots \otimes V_{n}$ , we would sometimes want to approximate the tensor $v$ with a tensor that has less parameters. Suppose that $(m_{0}, \dots, m_{n})$ is a sequence of natural numbers with $m_{0} = m_{n} = 1$ . Suppose that $X_{i, j}$ is a $m_{i - 1} \times m_{i}$ matrix over the field $K$ for $1 \leq i \leq n$ and $1 \leq j \leq d_{i}$ . From the system of matrices $(X_{i, j})_{i, j}$ , we obtain a tensor $T ((X_{i, j})_{i, j}) = \sum_{i_{1}, \dots, i_{n}} e_{i_{1}} \otimes \dots \otimes e_{i_{n}} \cdot X_{1, i_{1}} \dots X_{n, i_{n}}$ . If the system of matrices $(X_{i, j})_{i, j}$ locally minimizes the distance $∥ v - T ((X_{i, j})_{i, j}) ∥$ , then the tensor $T ((X_{i, j})_{i, j})$ is a dimensionality reduction of $v$ which we shall denote by $u$ .
Intuition: One can associate the tensor product $V_{1} \otimes \dots \otimes V_{n}$ with the set of all degree $n$ homogeneous non-commutative polynomials that consist of linear combinations of the monomials of the form $x_{1, i_{1}} \dots x_{n, i_{n}}$ . Given, our matrices $X_{i, j}$ , we can define a linear functional $ϕ : V_{1} \otimes \dots \otimes V_{n} \to K$ by setting $ϕ (p) = p ((X_{i, j})_{i, j})$ . But by the Reisz representation theorem, the linear functional $ϕ$ is dual to some tensor in $V_{1} \otimes \dots \otimes V_{n}$ . More specifically, $ϕ$ is dual to $T ((X_{i, j})_{i, j})$ . The tensors of the form $T ((X_{i, j})_{i, j})$ are therefore the
Advantages:
1. In my computer experiments, the reduced dimension tensor $u$ is often (but not always) unique in the sense that if we calculate the tensor $u$ twice, then we will get the same tensor. At least, the distribution of reduced dimension tensors $u$ will have low Renyi entropy. I personally consider the partial uniqueness of the reduced dimension tensor to be advantageous over total uniqueness since this partial uniqueness signals whether one should use this tensor dimensionality reduction in the first place. If the reduced tensor is far from being unique, then one should not use this tensor dimensionality reduction algorithm. If the reduced tensor is unique or at least has low Renyi entropy, then this dimensionality reduction works well for the tensor $v$ .
2. This dimensionality reduction does not depend on the choice of orthonormal basis $e_{i, 1}, \dots, e_{i, d_{i}}$ . If we chose a different basis for each $V_{i}$ , then the resulting tensor $u$ of reduced dimensionality will remain the same (the proof is given below).
3. If $K$ is the field of complex numbers, but all the entries in the tensor $v$ happen to be real numbers, then all the entries in the tensor $u$ will also be real numbers.
4. This dimensionality reduction algorithm is intuitive when tensors are considered as homogeneous non-commutative polynomials.
Disadvantages:
1. This dimensionality reduction depends on a canonical cyclic ordering the inner product spaces $V_{1}, \dots, V_{n}$ .
2. Other notions of dimensionality reduction for tensors such as the CP tensor dimensionality reduction and the Tucker decompositions are more well-established, and they are obviously attempted generalizations of the singular value decomposition to higher dimensions, so they may be more intuitive to some.
3. The tensors of reduced dimensionality $T ((X_{i, j})_{i, j})$ have a more complicated description than the tensors in the CP tensor dimensionality reduction.
Proposition: The set of tensors of the form $\sum_{i_{1}, \dots, i_{n}} e_{1, i_{1}} \otimes \dots \otimes e_{n, i_{n}} X_{1, i_{1}} \dots X_{n, i_{n}}$ does not depend on the choice of bases $(e_{i, 1}, \dots, e_{i, d_{i}})_{i}$ .
Proof: For each $i$ , let $f_{i, 1}, \dots, f_{i, d_{i}}$ be an alternative basis for $V_{i}$ . Then suppose that $e_{i, j} = \sum_{k} u_{i, j, k} f_{i, k}$ for each $i, j$ . Then
$\sum_{i_{1}, \dots, i_{n}} e_{1, i_{1}} \otimes \dots \otimes e_{n, i_{n}} X_{1, i_{1}} \dots X_{n, i_{n}}$
$= \sum_{i_{1}, \dots, i_{n}} \sum_{k_{1}} u_{1, i_{1}, k_{1}} f_{1, i_{1}} \otimes \dots \otimes \sum_{k_{n}} u_{n, i_{n}, k_{n}} f_{n, i_{n}} X_{1, i_{1}} \dots X_{n, i_{n}}$
$= \sum_{k_{1}, \dots, k_{n}} f_{1, k_{1}} \otimes \dots \otimes f_{n, k_{n}} \sum_{i_{1}, \dots, i_{n}} u_{1, i_{1}, k_{1}} \dots u_{n, i_{n}, k_{n}} X_{1, i_{1}} \dots X_{n, i, n}$
$= \sum_{k_{1}, \dots, k_{n}} f_{1, k_{1}} \otimes \dots \otimes f_{n, k_{n}} (\sum_{i_{1}} u_{1, i_{1}, k_{1}} X_{1, i_{1}}) \dots (\sum_{i_{n}} u_{n, i_{n}, k_{n}} X_{i_{n}})$ . Q.E.D.
A failed generalization: An astute reader may have observed that if we drop the requirement that $m_{n} = 1$ , then we get a linear functional defined by letting
$ϕ (p) = Tr (p ((X_{i, j})_{i, j}))$ . This is indeed a linear functional, and we can try to approximate $v$ using a the dual to $ϕ$ , but this approach does not work as well.

Joseph Van Name Dec 12, 2023, 10:49 PM
3 points
0
in reply to: Algon’s comment on: Joseph Van Name’s Shortform
Thanks for pointing that out. I have corrected the typo. I simply used the symbol $r$ for two different quantities, but now the probability is denoted by the symbol $α$ .

Joseph Van Name Dec 11, 2023, 10:36 PM
10 points
0
on: Joseph Van Name’s Shortform
Every entry in a matrix counts for the $L_{2}$ -spectral radius similarity. Suppose that $A_{1}, \dots, A_{r}, B_{1}, \dots, B_{r}$ are real $n \times n$ -matrices. Set $A^{\otimes 2} = A \otimes A$ . Define the $L_{2}$ -spectral radius similarity between $(A_{1}, \dots, A_{r})$ and $(B_{1}, \dots, B_{r})$ to be the number
$\frac{ρ (A_{1} \otimes B_{1} + \dots + A_{r} \otimes B_{r})}{ρ (A_{1}^{\otimes 2} + \dots + A_{r}^{\otimes 2})^{1 / 2} ρ (B_{1}^{\otimes 2} + \dots + B_{r}^{\otimes 2})^{1 / 2}}$ . Then the $L_{2}$ -spectral radius similarity is always a real number in the interval $[0, 1]$ , so one can think of the $L_{2}$ -spectral radius similarity as a generalization of the value $\frac{| ⟨ u, v ⟩ |}{∥ u ∥ \cdot ∥ v ∥}$ where $u, v$ are real or complex vectors. It turns out experimentally that if $A_{1}, \dots, A_{r}$ are random real matrices, and each $B_{j}$ is obtained from $A_{j}$ by replacing each entry in $B_{j}$ with $0$ with probability $1 - α$ , then the $L_{2}$ -spectral radius similarity between $(A_{1}, \dots, A_{r})$ and $(B_{1}, \dots, B_{r})$ will be about $\sqrt{α}$ . If $u = (A_{1}, \dots, A_{r}), v = (B_{1}, \dots, B_{r})$ , then observe that $\frac{| ⟨ u, v ⟩ |}{∥ u ∥ \cdot ∥ v ∥} \approx \sqrt{α}$ as well.
Suppose now that $A_{1}, \dots, A_{r}$ are random real $n \times n$ matrices and $C_{1}, \dots, C_{r}$ are the $m \times m$ submatrices of $A_{1}, \dots, A_{r}$ respectively obtained by only looking at the first $m$ rows and columns of $A_{1}, \dots, A_{r}$ . Then the $L_{2}$ -spectral radius similarity between $A_{1}, \dots, A_{r}$ and $C_{1}, \dots, C_{r}$ will be about $\sqrt{m / n}$ . We can therefore conclude that in some sense $C_{1}, \dots, C_{r}$ is a simplified version of $A_{1}, \dots, A_{r}$ that more efficiently captures the behavior of $A_{1}, \dots, A_{r}$ than $B_{1}, \dots, B_{r}$ does.
If $A_{1}, \dots, A_{r}, B_{1}, \dots, B_{r}$ are independent random matrices with standard Gaussian entries, then the $L_{2}$ -spectral radius similarity between $(A_{1}, \dots, A_{r})$ and $(B_{1}, \dots, B_{r})$ will be about $1 / \sqrt{r}$ with small variance. If $u, v$ are random Gaussian vectors of length $r$ , then $\frac{| ⟨ u, v ⟩ |}{∥ u ∥ \cdot ∥ v ∥}$ will on average be about $c / \sqrt{r}$ for some constant $c$ , but $\frac{| ⟨ u, v ⟩ |}{∥ u ∥ \cdot ∥ v ∥}$ will have a high variance.
These are some simple observations that I have made about the spectral radius during my research for evaluating cryptographic functions for cryptocurrency technologies.

Joseph Van Name Dec 7, 2023, 11:32 AM
1 point
0
on: Deep Forgetting & Unlearning for Safely-Scoped LLMs
The problem of unlearning would be solved (or kind of solved) if we just used machine learning models that optimize fitness functions that always converged to the same local optimum regardless of the initial conditions (pseudodeterministic training) or at least has very few local optima. But this means that we will have to use something other than neural networks for this and instead use something that behaves much more mathematically. Here the difficulty is to construct pseudodeterministically trained machine learning models that can perform fancy tasks about as efficiently as neural networks. And, hopefully we will not have any issues with a partially retrained pseudodeterministically trained ML model remembering just enough of the bad thing to do bad stuff.

Joseph Van Name Nov 11, 2023, 12:38 PM
−12 points
−3
on: Who is Sam Bankman-Fried (SBF) really, and how could he have done what he did? - three theories and a lot of evidence
The cryptocurrency sector is completely and totally unable to see any merit in using cryptocurrency mining to solve a scientific problem regardless of the importance of the scientific problem or the complete lack of drawbacks from using such a scientific problem as their mining algorithm. Yes. They would rather just see Bitcoin mining waste as much resources as possible than put those resources to good use. Since the cryptocurrency sector lacks the characteristics that should be desirable, FTX should not have surprised anyone.

Joseph Van Name Oct 30, 2023, 11:34 PM
3 points
0
in reply to: Joseph Van Name’s comment on: Joseph Van Name’s Shortform
I think that all that happened here was the matrices $A_{1}, \dots, A_{r}$ just ended up being diagonal matrices. This means that this is probably an uninteresting observation in this case, but I need to do more tests before commenting any further.

Joseph Van Name Oct 29, 2023, 11:33 PM
1 point
0
on: Joseph Van Name’s Shortform
Suppose that $q, r, d$ are natural numbers. Let $1 < p < \infty$ . Let $z_{i, j}$ be a complex number whenever $1 \leq i \leq q, 1 \leq j \leq r$ . Let $L : M_{d} (C)^{r} ∖ {0} \to [- \infty, \infty)$ be the fitness function defined by letting $L (X_{1}, \dots, X_{r})$ $= (\sum_{i = 1}^{q} log (ρ (\sum_{j = 1}^{r} z_{i, j} X_{j})) / q) - log (∥ \sum_{j = 1}^{r} X_{j} X_{j}^{*} ∥_{p}) / 2$ . Here, $ρ (X)$ denotes the spectral radius of a matrix $X$ while $∥ X ∥_{p}$ denotes the Schatten $p$ -norm of $X$ .
Now suppose that $(A_{1}, \dots, A_{r}) \in M_{d} (C)^{r} ∖ {0}$ is a tuple that maximizes $L (A_{1}, \dots, A_{r})$ . Let $M : C^{r} ∖ {0} \to [- \infty, \infty)$ be the fitness function defined by letting $M (w_{1}, \dots, w_{r}) = log (ρ (w_{1} A_{1} + \dots + w_{r} A_{r})) - log (∥ (w_{1}, \dots, w_{r}) ∥_{2})$ . Then suppose that $(v_{1}, \dots, v_{r}) \in C^{r} ∖ {0}$ is a tuple that maximizes $M (v_{1}, \dots, v_{r})$ . Then we will likely be able to find an $ℓ \in {1, \dots, q}$ and a non-zero complex number $α$ where $(v_{1}, \dots, v_{r}) = α \cdot (x_{ℓ, 1}, \dots, x_{ℓ, r})$ .
In this case, $(z_{i, j})_{i, j}$ represents the training data while the matrices $A_{1}, \dots, A_{r}$ is our learned machine learning model. In this case, we are able to recover some original data values from the learned machine learning model $A_{1}, \dots, A_{r}$ without any distortion to the data values.
I have just made this observation, so I am still exploring the implications of this observation. But this is an example of how mathematical spectral machine learning algorithms can behave, and more mathematical machine learning models are more likely to be interpretable and they are more likely to have a robust mathematical/empirical theory behind them.

Joseph Van Name’s Shortform

Joseph Van NameOct 29, 2023, 11:33 PM

2 points

17 comments1 min readLW link

Joseph Van Name Oct 22, 2023, 5:13 PM
5 points
0
on: [Cross-post]The theoretical computational limit of the Solar System is 1.47x10^49 bits per second.
I forgot to mention another source of difficulty in getting the energy efficiency of the computation down to Landauer’s limit at the CMB temperature.
Recall that the Stefan Boltzmann equation states that the power being emitted from an object by thermal radiation is equal to $P = A \cdot ϵ \cdot σ \cdot T^{4}$ . Here, $P$ stands for power, $A$ is the surface area of the object, $ϵ$ is the emissivity of the object ( $ϵ$ is a real number with $0 \leq ϵ \leq 1$ ), $T$ is the temperature, and $σ$ is the Stefan-Boltzmann constant. Here, $σ \approx 5.67 \cdot 10^{- 8} \cdot W \cdot K^{- 4} \cdot m^{- 2}$ .
Suppose therefore that we want a Dyson sphere with radius $r$ that maintains a temperature of 4 K which is slightly above the CMB temperature. To simplify the calculations, I am going to ignore the energy that the Dyson sphere receives from the CMB so that I obtain a lower bound for the size of our Dyson sphere. Let us assume that our Dyson sphere is a perfect emitter of thermal radiation so that $ϵ = 1$ .
Earth’s surface has a temperature of about $300 K$ . In order to have a temperature of $4 K$ , our Dyson sphere needs to receive $(1 / 75)^{4}$ the energy per unit of area. This means that the Dyson sphere needs to have a radius of about $(75^{4})^{1 / 2} = 75^{2} = 5625$ astronomical units (recall that the distance from Earth to the sun is 1 astronomical unit).
Let us do more precise calculations to get a more exact radius of our Dyson sphere.
$r^{2} = \frac{P}{4 π σ (4 K)^{4}}$ , so $r = \sqrt{\frac{P}{4 π σ (4 K)^{4}}} = \frac{1}{2 (4 K)^{2}} \cdot \sqrt{\frac{P}{π σ}}$ $\approx 1.45 \cdot 10^{15} m$ which is about 15 percent of a light-year. Since the nearest star is 4 light years away, by the time that we are able to construct a Dyson sphere with a radius that is about 15 percent of a light year, I think that we will be able to harness energy from other stars such as Alpha Centauri.
The fourth power in the Stefan Boltzmann equation makes it hard for cold objects to radiate heat.

Joseph Van Name Oct 17, 2023, 11:16 PM
2 points
0
on: [Cross-post]The theoretical computational limit of the Solar System is 1.47x10^49 bits per second.
This post uses the highly questionable assumption that we will be able to produce a Dyson sphere that can maintain a temperature at the level of the cosmic microwave background before we will be able to use energy efficient reversible computation to perform operations that cost much less than $k \cdot T$ energy. And this post also makes the assumption that we will achieve computation at the level of about $k \cdot T \cdot ln (2)$ per bit deletion before we will be able to achieve reversible computation. And it gets difficult to overcome thermal noise at an energy level well above $k \cdot T \cdot ln (2)$ regardless of the type of hardware that one uses. At best, this post is an approximation for the computational power of a Dyson sphere that may be off by some orders of magnitude.

Joseph Van Name Oct 16, 2023, 7:01 PM
1 point
−1
in reply to: Leo P.’s comment on: Hyperreals in a Nutshell
Let $X,Y$ be topological spaces. Then a function $f:X\rightarrow Y$ is continuous if and only if whenever $(x_d)_{d\in D}$ is a net that converges to the point $x$, the net $(f(x_d))_{d\in D}$ also converges to the point $f(x)$. This is not very hard to prove. This means that we do not have to discuss as to whether continuity should be defined in terms of open sets instead of limits because both notions apply to all topological spaces. If anything, one should define continuity in terms of closed sets instead of open sets since closed generalize slightly better to objects known as closure systems (which are like topological spaces, but we do not require the union of two closed sets to be closed). For example, the collection of all subgroups of a group is a closure system, but the complements of the subgroups of a group have little importance, so if we want the definition that makes sense in the most general context, closed sets behave better than open sets. And as a bonus, the definition of continuity works well when we are taking the inverse image of closed sets and when we are taking the closure of the image of a set.
With that being said, the good thing about continuity is that it has enough characterizations so that at least one of these characterizations is satisfying (and general topology texts should give all of these characterizations even in the context of closure systems so that the reader can obtain such satisfaction with the characterization of his or her choosing).

Joseph Van Name Oct 16, 2023, 4:13 PM
2 points
0
in reply to: Valdes’s comment on: Hyperreals in a Nutshell
I have heard of filters and ultrafilters, but I have never heard of anyone calling any sort of filter a hyperfilter. Perhaps it is because the ultrafilters are used to make fields of hyperreal numbers, so we can blame this on the terminology. Similarly, the uniform spaces where the hyperspace is complete are called supercomplete instead of hypercomplete.
But the reason why we need to use a filter instead of a collection of sets is that we need to obtain an equivalence relation.
Suppose that $I$ is an index set and $X_{i}$ is a set with $| X_{i} | > 2$ for $i \in I$ . Then let $M$ be a collection of subsets of $I$ . Define a relation $R$ on $\prod_{i \in I} X_{i}$ by setting $((x_{i})_{i \in I}, (y_{i})_{i \in I}) \in R$ if and only if ${i \in I : x_{i} = y_{i}} \in M$ . Then in order for $R$ to be an equivalence relation, $R$ must be reflexive, symmetric, and transitive. Observe that $R$ is always symmetric, and $R$ is reflexive precisely when $I \in M$ .
Proposition: The relation $R$ is transitive if and only if $M$ is a filter.
Proof:
$\leftarrow$ Suppose that $M$ is a filter. Then whenever $((x_{i})_{i \in I}, (y_{i})_{i \in I}), ((y_{i})_{i \in I}, (z_{i})_{i \in I}) \in R$ , we have
${i \in I : x_{i} = y_{i}}, {i \in I : y_{i} = z_{i}} \in M$ , so since
${i \in I : x_{i} = z_{i}} \supseteq {i \in I : x_{i} = y_{i}} \cap {i \in I : y_{i} = z_{i}}$ , we conclude that ${i \in I : x_{i} = z_{i}}$ as well. Therefore, $((x_{i})_{i \in I}, (z_{i})_{i \in I}) \in R$ .
$\to$ . Suppose now that $R, S \in M$ . Then let let $y = 0, x = χ_{R^{c}}, z = 2 \cdot χ_{S^{c}}$ where $χ$ denotes the characteristic function. Then $[x = y] = R, [y = z] = S$ and $[x = z] = R \cap S$ . Therefore, $(x, y), (y, z) \in R$ , so by transitivity, $(x, z) \in R$ as well, hence $R \cap S = [x = z] \in M$ .
Suppose now that $R \subseteq S$ and $R \in M$ . Let $x = 0, y = χ_{R^{c}}$ and set $z = 2 \cdot χ_{S^{c}}$ .
Observe that $[x = y] = R$ and $[y = z] = R$ . Therefore, $(x, y), (y, z) \in R$ . Thus, by transitivity, we know that $(x, z) \in R$ . Therefore, $S = [x = z] \in M$ . We conclude that $M$ is closed under taking supersets. Therefore, $M$ is a filter.
Q.E.D.

Joseph Van Name Oct 16, 2023, 12:16 AM
3 points
0
in reply to: Viliam’s comment on: Hyperreals in a Nutshell
Yes. We have 2=[(2,2,2,...)]. But we can compare 2 with (1,3,1,3,1,3,...) since (1,3,1,3,1,3,1,3,...)=1 (this happens when the set of all even natural numbers is in your ultrafilter) or (1,3,1,3,1,3,1,3,...)=3 (this happens when the set of all odd natural numbers is in your ultrafilter). Your partially ordered set is actually a linear ordering because whenever we have two sequences $(a_{n})_{n}, (b_{n})_{n}$ , one of the sets
${n : a_{n} > b_{n}}, {n : a_{n} < b_{n}}, {n : a_{n} = b_{n}}$ is in your ultrafilter (you can think of an ultrafilter as a thing that selects one block out of every partition of the natural numbers into finitely many pieces), and if your ultrafilter contains
${n : a_{n} > b_{n}}$ , then $[(a_{n})_{n}] > [(b_{n})_{n}]$ .

Joseph Van Name Oct 13, 2023, 7:45 PM
1 point
0
on: Deep learning models might be secretly (almost) linear
I trained a (plain) neural network on a couple of occasions to predict the output of the function $x_{1} \oplus \dots \oplus x_{5}$ where $x_{1}, \dots, x_{5}$ are bits and $\oplus$ denotes the XOR operation. The neural network was hopelessly confused despite the fact that neural networks usually do not have any trouble memorizing large quantities of random information. This time the neural network could not even memorize the truth table for XOR. While the operation $(x_{1}, \dots, x_{5}) \mapsto x_{1} \oplus \dots \oplus x_{5}$ is linear over the field $F_{2}$ , it is quite non-linear over $R$ . The inability for a simple neural network to learn this function indicates that neural networks are better at learning when they are not required to stray too far away from linearity.