math terminology as convolution

Link post

On the one hand, this theory generalizes the Fuchsian and Bers uniformizations of complex hyperbolic curves and their moduli to nonarchimedean places. It is for this reason that we shall often refer to this theory as p-adic Teichmuller theory, for short. On the other hand, the theory under discussion may be regarded as a fairly precise hyperbolic analogue of the Serre-Tate theory of ordinary abelian varieties and their moduli.

— Shinichi Mochizuki

I know some of these words.

— Ed in Good Burger (1997)

terminology as convolution

Math research papers are notorious for using specialized and obscure terminology. Why is that? Why can’t they describe things in terms of simpler components?

Chemists often talk about carbon atoms. They don’t say “an atom with 6 protons, 6 neutrons, and 6 electrons”. Those subatomic particles are grouped together into a single conceptual item. The power of convolutional neural networks shows us that such grouping is not merely a matter of convenience—rather, the selection of which things to group together is a system of thinking.

Neural network research suggests a lot about how humans think. For example, I think the fact that massively multilingual language models work well, with many languages ultimately sharing the same latent space, is a refutation of the Sapir-Whorf Hypothesis. Modern neural networks have also, I think, shown us something about what concepts are. Some linguists have argued that a word like “dog” is a discrete package, a fixed item to which additional information is attached. Based on my comparison of how humans think and now neural networks operate, my view is that the concept “dog” is 3 things:

  1. A region of a latent space for doglike concepts.

  2. One or more prototype dog concepts, which are points in that latent space used to define the region of dog-ness.

  3. A convolution-like transformation by which some data can be packaged into a point in a latent space: “this is a dog” is a way of examinining some data from a photo.

Math is often considered universal, but many of the concepts are partly arbitary. For example, some people have suggested pi as a universal number that alien species would recognize, but other people argue that 2*pi is a more fundamental constant.

For a slightly more “complex” example, consider imaginary numbers. The fundamental theorem of algebra involves them, and that sounds fundamental...but complex numbers can be considered just a special case of replacing numbers with matrices—specifically, with a subset of 2x2 matrices that can be represented by 2 numbers and multiplied with fewer operations. For example, here’s Euler’s formula in matrix form. There are some advantages to computation with that representation, but arguably it’s just a computational optimization with no conceptual value.

If we ask whether the concepts used in current mathematics are “good” or “bad”, the usual presumption is that some are good and some are bad, but those are relative terms that depend on what concepts they’re compared to. Some math concepts considered important hundreds of years ago are now considered irrelevant.

other issues

When I say the language of current advanced math is opaque, I’m mainly talking about the concepts, but people say that to mean other things as well:

names

Overloading of common words can be annoying, especially for technical generalists who might go from adding matrices of composite numbers to adding matrices of composite materials. But math isn’t any worse in this regard than various engineering fields.

A lot of mathematical terms are called [name]‘s theorem or [name]’s lemma. These are hard to remember because they don’t provide any information about the topic. (Personally, I don’t usually want to have to remember names of mathematicians unless they’re on the level of Euclid, Gauss, or Hilbert.) But math isn’t any worse in this regard than biology or medicine.

equations

Math equations can be hard to read. I think programming languages are often clearer. Yes, I’ve seen mathematicians comparing compact expressions using symbols for summation and integrals to awkward-looking equivalents in pseudocode, but they’re missing the point. The main reason complex math equaations are hard to read is because they use, eg, 12 single-letter variables, 7 of which were defined over the previous 3 pages, and 5 of which are defined below the equation. Nobody sane writes code like that unless they’re entering an obfuscated programming competition. Descriptive variable names and multi-step definitions are better for complex formulae.

And then, if you have longer variable names, much of that customary math notation stops working well. It’s also much harder to produce that notation by typing. The notation of math was originally developed for writing simple equations on a chalkboard for people already familiar with related work. It was never meant for typing, extremely complex equations, or distributing work to people in other fields.

network optimization

The 4-color theorem was proven with a computer-generated proof over 400 pages long. Here are some other particularly long math proofs. Conceptual tools are supposed to make things easy; what long proofs indicate to me isn’t that they have more insights for me to learn, but rather that the tools being used are inadequate for the task—like people are hammering in nails with a rock instead of using a nailgun.

Is the answer building a tower of abstraction even higher? Or...was a wrong turn taken somewhere? Statistically speaking, some of the turns taken were probably suboptimal.

Let’s return to the metaphor of math terminology as convolutions in a neural network. When a large neural network is stuck in some bad local minimum, what can be done? There are multiple options.

The most-effective way to train a neural network is by distillation, imitating another network with better performance. So, perhaps the best option would be to find an alien civilization with more-advanced mathematics and copy the concepts they use.

Sometimes people training neural networks will start over with a new initialization. (So, perhaps mathematics should all be re-developed from scratch, but that seems like a lot of work.) That’s done less than it used to be, because neural networks have gotten larger, and increasing dimensionality adds connections between (what would be) local minima. These days, it’s more likely that there’s a problem with the optimizer than that training is stuck in a bad local minimum.

Let’s consider how gradient descent works, and how that compares to development of math. A network is tried on many tasks, and the effect that various changes would have on performance is averaged out across those tasks. Then, the whole network is updated slightly, and the process repeats. So, people use math to do tasks, and sometimes they notice small changes that would improve things for them. Do people share those possible changes, average them out, and then apply them and see how they work out, perhaps stochastically if they’re discrete changes? No; what happens more often is that mathematicians develop private notations that they use for their own notes. Friction is too high for the culture of mathematics to proceed down long shallow gradients.

If math involves metaphorical convolutions, and good concepts are good because they produce points in a well-structured latent space, that means that math doesn’t advance through new proofs and theorems per se. Rather, math advances from new concepts and transformations, and proofs are just the means by which they’re tested. This then implies that a shorter and more elegant proof of something already proven is just as important as a new proof, perhaps even more so. But incentives in mathematics aren’t structured around that being the case, perhaps because elegance of proofs is harder for institutions to measure.

As for why I’m writing this now, it’s because I’ve been thinking about questions like “why Transformers work better than other neural network architectures”. The tools developed by mathematicians so far seem inadequate for that, besides trivialities like distance in high-dimensional Euclidean spaces.