I find it rather weird that mathematics is usually thought to people on a historical basis. It’s likely possible for a 4th century Alexandrian mathematician to teach it in a modern primary school and get decent results. An 18th century Venetian tutor could carry the curriculum all the way up to high school. I’ll grant they might need a week or two to brush up on sets and some modern notation.
On the other hand, Darwin or Huxley would most get fired within the week if they tried to teach high school biology. The great Aristotle would be unable to introduce a 1st grader to any of the sciences; Despite his works in the precursors of modern science remaining foundational more than a millennium after his death.
Mathematical knowledge retains validity over time to a great extent. When it seldom become irrelevant or poorly formulated, it’s nonetheless still valid.
The obvious hypothesis as to why this is true is that mathematics is rather “pure” compared to many/all other fields.
That’s boring, I’ve come up with a second, more interesting and much more scandalous hypothesis.
Mathematical abstractions are very leaky
In case the terms is unfamiliar to you, a leaky abstraction is an abstraction that leaks the details that it’s supposed to abstract away.
Take something like sigma summation (∑), at first glance it look like an abstraction, but it’s too leaky to be a good one. Why is it leaky ? Because all the concept it “abstracts” over need to be understood in order to use the sigma summation. You need to “understand” operations, operators, numbers, ranges, limits and series.
For another example take integrals (∫), where the argument for leakiness can be succinctly state. To master the abstraction one needs to understand all of calculus and all that is foundational to calculus (also known as “all of math” until 60 years ago or so).
A leaky abstraction is the rule rather than the norm in mathematics. The only popular counterexample that comes to mind are integral transform (think Laplace and Fourier Transforms). Indeed, any mathematician in the audience might scoff at me for using the word “abstraction” for what they think of as shorthand notations.
Why ? Arguably because mathematics has no such thing as an “observation” to abstract over.
Science has the advantage of making observations, it sees that X correlates with Y in some way, then it tries to find a causal relationship between the two via theory (which is essentially an abstraction).
Darwin’s observations about finches were not by any means “wrong”. Heck, most of what Jabir ibn Hayyan found out about the world is probably still correct. What is incorrect or insufficient are the theoretical frameworks they used to explain them.
Alchemy might have described the chemical properties and interactions of certain type of matter quite well. We’ve replaced it with the Bhoring model of chemistry simply because it abstracts away more observations than alchemy.
Thinking with competition, sexual reproduction, DNA, RNA and mutations explains Darwin’s observations. This doesn’t make his original “survival of the fittest” evolutionary framework was “wrong”. It just makes it a tool that outlived it’s value.
So mathematics ends up being “leaky” because it has not such thing as an “observation”. The fact that 2 * 3 = 6
is not observed, it’s simply “known”. Or… is it ?
The statement 95233345745213 * 4353614555235239 = 414609280180109235394973160907
is just as true as 2 * 3 = 6
.
Tell a well educated ancient Greek: “ 95233345745213 times 4353614555235239 is always equal to 414609280180109235394973160907, this must be fundamentally true within our system of mathematics”… and he will look at you like you’re a complete nut.
Tell that same ancient Greek:“ 2 times 3 is always equal to 6, this must be fundamentally true within our system of mathematics”…. and he might think you are a bit pompous, but overall he will nod and agree with your rather banal statement.
To a modern person there is little difference in how “obviously true” the two statement seem, because a modern person has a calculator.
But before the advent of more modern techniques for working with numbers, such calculation were beyond the reach of most if not all. There would be not “obvious” way of checking the truth of that first statement… people would be short 414609280180109235394973160897 fingers.
So really, there is something that serves a similar role to observations in mathematics. That which most can agree on as being “obviously true”. Obviously there’s cases where this concept break down a bit (e.g. Ramanujan summation), but these are the exception rather than the rule.
Computers basically allow us to raise the bar for “obviously true” really high. So high that “this is obviously false” brute force approaches can be used to disprove sophisticated conjectures. As long as we are willing to trust the implementation of the software and hardware, we can quickly validate any mathematical tool over a very large finite domain.
Computers also allow us to build functional mathematical abstraction. Because suddenly we can “use” those abstractions without understanding them. Being able to use a function and understanding what a function does are different things in a modern age, but this is a very recent development.
For the most part computers have been used to run leaky mathematical abstractions. Leaky by design, made for a world where one had to “build up” their knowledge to use them from the most obvious of truths.
However, I think that non leaky abstractions are slowly showing up to the party, and I they have a lot of potential. In my view, the best specimen of such an abstraction is the neural network.
Neural Networks as mathematical abstractions
As far as dimensionality reduction (DR) methods go, an autoencoder (AE) is basically one of the easiest ones me to explain, implement and understand. Despite being quite sophisticated compared to most other nonlinear DR methods.
The way I’d explain it (maybe a bit more rigorous than necessary) is:
1. Build a network that gets the data as an input and has to predict the same data as an output.
2. Somewhere in that network, preferably close to the output layer, add a layer
E
of sizen
, wheren
is the number of dimensions you want to reduce the data to.3. Train the network.
4. Run your data through the network and extract the output from the layer
E
as the reduced dimension representation of said data.5. Since we know from training the network that the values generated by
E
are good enough to reconstruct the input values, they must be a good representation.
This is a very simple explanation compared to that of most “classic” nonlinear DR algorithms. Even more importantly, it uses no mathematics whatsoever. Or rather, it uses a heap of mathematics, but it’s all abstracted away behind the word “train”.
Consider for a moment the specific case of DR. It could be argued that a subset of AEs are basically performing PCA (1) (2).
In the case of most “real” AEs are however nonlinear. We could see them as doing the same thing as a kernel PCA, but removing the need for us to choose the kernel.
AEs are also impressive because they “work well”. Most people working on a hard DR problem will probably use some brand of AE (e.g. a VAE).
Even more than that, AEs are fairly “generic” compared to other nonlinear DR algorithms, and there are a lot of these algorithms. So one can strive to understand when and why to apply various DR algorithms, or one can strive to understand AEs. The end result will be similar in efficacy when applied to various problems, but understanding AEs is much more quick than understanding a few dozen other DR algorithms.
Even better, in order to understand AEs the only “hard” part is to understand neural networks. But neural networks are a very broad tool so many people may already posses some understanding of them.
To understand neural networks you need three other concepts: automatic differentiation as used to compute the gradients, loss computation and optimization.
Automatic differentiation is hard, to the point where I assume most people otherwise skilled in coding and ML wouldn’t be able to replicate google’s jax or torch’s autograd. Luckily for us, automatic differentiation is very easy to abstract away and explain: “Based on a loss function computed between the real values and your outputs (the error), we use {magic} to estimate how much every weight and bias was responsible for that error and whether it’s influence increased or decreased the error. Further more, we can use {magic} to do this efficiently in batches”.
This sort of explanation is not that satisfactory, but I doubt most people go through life acquiring a much deeper understanding than this. I have no strong evidence that this is true, but empirically I notice new papers about new optimizers and loss functions pop up every day on /r/MachineLearning. I can’t remember the last time I saw a paper proposing significant improvements to common autograd methods or scrapping them altogether for something new.
Optimizers are… not that hard. Explaining something like Adam or random gradient descent to someone is fairly intuitive. Since the optimization itself is applied to rather trivial 2d functions and the process is just repeated a lot of times, it’s not that hard to conceptualize how it works. Alas, you can probably construct a “just so” story for why optimizers work similar to the one above, tell people to use AdamW and the world will be just fine.
Loss functions are certainly a trivial concept and if you stick to the heuristic behind basic loss functions and don’t try to factor them into crazy inferences, I’d argue anyone could probably write their own loss functions the same way they write their own training loops.
It’s really hard to argue how “harder” it is to understand these 3 concepts to the degree where you can understand how to design neural networks, versus how hard it is to understand the kernel trick. But I would certainly say they are probably in the same difficulty and time ballpark.
However, the advantages of neural networks are:
a) You don’t actually have to deeply understand the concept of automatic differentiation and optimization in order to design a network that does roughly what you want.
b) Once you learn the 3 basic concepts, you have the foundation on which to understand anything neural network related. Be it AEs, RNNs, CNNs, Residual blocks/nets or transformers.
So essentially, neural networks become a sort of mathematical abstraction that isn’t very leaky. Sure, there some need for understanding of the underlying concepts in order to figure out how to use it, but it’s rather minimal. You can train a programmer with basic algebra knowledge to use Keras in a few days. The same cannot be said about using 4-dimensional geometry or complex calculus.
And the neutral network abstraction gives us a higher-level language to talk about ML algorithms. As was the case with the AE, some of the resulting algorithms are arguably not that far off from the “classical” ML world, but instead of intellectual monstrosities , the neural network based designs are conceptually simple, all of the complexity is in finding the weights and biases, hidden behind the abstraction.
For example, I’ve heard it argued that a typical transformer network (think BERT), is basically similar in structure to a “classical” NLP processing pipeline (3).
About 3 years ago, it was really popular to argue that CNNs layers could be though of as feature detectors that followed a simple to complex hierarchy the closer you got to the output layer (e.g. the first layer handles edges, the second layer handles simple geometric shapes… the 12th layer handles facial features specific to mullet fishes and seagulls).
The fact that they sometimes resemble these algorithms, however, is more of a cherry on top rather than the crux of the matter. At bottom, it’s important that models created with this abstraction should work, the last 5 years have answered that question with a resounding “Yes”.
So why is this important ?
Well, it’s important because non-leaky mathematical abstractions are rather rare. Especially ones that have such a low point of entry to and are used so widely. I wouldn’t compare neural networks to integral transforms just yet, but I think we are getting there.
I’d also argue it’s important because it explains why neural networks have not only taken over the field of “hard” ML problems, but are now making their way into all facets of ML where a SVM or DT or GB classifier might have worked just fine. It’s not necessarily because they are “better”, but it’s because people have more confidence in using them as an abstraction.
Lastly, it’s important because it’s a way to conceptualize why neural networks are in a way better than classical ML algorithms. Because this lack of leaks means that anyone can play around with them without breaking the whole thing and being thrown one level down. Want to change the shape ? Sure. Want to change the activation function ? Sure. Want to add a state function to certain elements ? Go ahead. Want to add random connections between various elements ? Don’t see why not… etc. They have a lot of tweakable hyperparameters and they are not modifiable just in principle.
Of course, every ML algorithm has tweakable parameters, but as soon as you start changing the kernel function of your SVM you realize that for the tweaks to be useful, the abstraction must break down and you need to learn the concepts underneath (and so on, and so on).
It’s rare for me to argue that a relatively popular and hyped up thing is “even better” than people think. But in the case of neural networks, I truly think that they are among the first of a “new” type of mathematical abstractions. They allow people that don’t have the dozen+ years background of learning applied mathematics, to do applied mathematics.
I’m not sure I follow. I certainly did a lot of integration before I knew how to formalize the concept, and I think the formal details only rarely leak. Certainly, I got through an entire four-year math degree without learning most of the formalisms listed there.
Perhaps this is not “mastering” integrals, but… if integrals are above the bar for leakiness, I’d be surprised if neural nets are below it (though I’m less comfortable with those than I am with integrals).
I mean, the kind of integral one solves in school are rather trivial, essentially edge cases that never come up IRL. (e.g. ∫x^2 or ∫e*x kinda thing)
But even in that case, you still have to correctly define what you want to integrate as a function, you can’t just draw a random geometric shape and integrate it and you have to correctly “use” the integral operator.
Given an arbitrary function it’s not at all obvious what the integral of that function will look like and it sometimes requires a lot of skill to deduce.
Given an arbitrary problem where integrals are needed I find that it’s often non-obvious how to pick the bound, especially when we get into 2d and 3d integrals (or however you call them, I’m referring to ∫∫ and ∫∫∫).
Interesting perspective, thanks for crossposting!
So is this an argument for the end-to-end principle?
My read was that it’s less an argument for the end-to-end principle and more an argument for modular, composable building blocks of which understanding of internals is not required (not the author though).
(Note that my experience of trying new combinations of deep learning components hasn’t really matched this. E.g., I’ve spent a lot of time and effort trying to get new loss functions to work with various deep learning architectures, often with very limited success and often could not get away with not understanding what was going on “under the hood”.)
If it could be construed as me arguing ‘for’ something than yes, this is what I was arguing for. I’m not seeing how the end-to-end principle applies here (as in, the one used in networking), but maybe it’s a different usage of the term I’m unfamiliar with.
It’s just a for loop.
There’s a smaller domain where we can validate the proposed counterexamples. To borrow from your example:
“this is obviously false” 27^5+84^5+110^5+133^5=144^5.
How is this a break down? (I don’t know what you’re building.)
That wording at the end suggests a typo.
It’s not a for loop, for loops don’t deal with infinity as far as I know.
As in, the results 1 + 2 + 3 + 4 … = −1/12 is “obviously false” yet mathematically true. So the pattern of “if something is true in a very intuitive way than it must be mathematically true”, doesn’t hold in those kind of cases (as opposed to the 2 * 3 = 6 case, where mathematics correctly describe what we intuit to be true by saying the statement is correct, at least if you think of mathematics as a “language” in the “programming but running on wetware” sense.
This only seems to be the case because the equals sign is redefined in that sentence.
I’d wouldn’t say it’s “true”.
Unless you think 1=0.
Proof:
[1] x = 1 + 1 + 1 + ….
Subtract 1 from both sides.
[2] x-1 = 1 + 1 + 1 + …
Substitute using [1].
[3] x-1 = x
Subtract x from both sides.
[4] −1 = 0
Multiply both sides by negative 1.
[5] 1 = 0
I’m pretty sure appending a single number to an infinite series is not the same as appending a number to each of the terms (e.g. combining two infinite series as per my example).
But even if what you wrote were “correct” by the same token that the sum of the divergent series I mentioned is, it doesn’t have much to do my point in that paragraph, which was to say that these kind of statements make no intuitive sense but yet have some correctness to them.
They are correct if you accept a strange premise like “infinity = 0” or ignore mistakes, like the one I made in the proof above.