A Butterfly’s View of Probability
Cross-posted from my blog.
Thanks to Alex Cai and Haneul Shin for discussing these ideas with me. Thanks to Eric Neyman for feedback.
What do we really mean when we use the word “probability”? Assuming that the universe obeys deterministic physical laws, an event must occur with either probability 0 or 1. The future positions of every atom in the universe are completely determined by their initial conditions. So what do we mean when we make statements like “Trump has a 25% chance of winning in 2024″? It will either happen, or it won’t. From an objective, omniscient standpoint, there’s no room for doubt.
One response is to reject determinism. Maybe in Newton’s day we believed the universe was deterministic, but now we know about wave functions and Heisenberg uncertainty and all of that stuff. If we accept that there is true randomness occurring on the quantum level, then the outcome of the next election isn’t predetermined — it will depend on all of the quantum interactions that occur between now and 2024. With this view, it makes complete sense to assign fractional probabilities.
But quantum randomness is a highly non-obvious property of physics… Is there a way to make sense of probability without relying on it? In this post, I hope to outline a new way of defining the probability of a future event in a deterministic system. In contrast with the Bayesian view — in which uncertainty about an event comes from incomplete information — and the frequentist view — which relies on an experiment being repeatable — this “Butterfly’s View of probability” draws its randomness from chaos theory.
Bayesianism and Frequentism
Let’s go over these two existing views, which are by far the most commonly accepted interpretations of probability.
The Bayesian view of probability is that randomness arises from incomplete information about the world. It is impossible to be aware of the current position of every atom in the universe at once. A Bayesian reasoner embraces this uncertainty by considering probability distributions over all possible universes consistent with his observations. At the core of this philosophy is Bayesian updating: starting with a prior probability distribution, upon seeing new evidence about the world a Bayesian reasoner will update this distribution according to Bayes’ Rule. The probability of an event is the proportion of universes in this distribution in which the event occurs.
Bayesianism has a lot going for it. It works great in theory, serving as the fundamental mathematical law by which a rational agent’s knowledge about the world must interact with itself. It also works great in practice, making accurate predictions as the cornerstone of modern statistics.
But there is one thing that Bayesianism lacks: objectivity. This is not to say that it is unmathematical, but rather that Bayesian probability is inherently subjective. The probability of an event is only defined relative to an agent. Alice and Bob, while both being perfect Bayesian reasoners, can assign different probabilities to Trump winning simply because they have different observations or priors. Because of this, Bayesian probability is often thought of as a “degree of personal belief” rather than an objective probability. In the real world, this is more of a feature than a bug — nobody has perfect information, and if we did then we wouldn’t care about probability in the first place. But in this post, our goal is to find an interpretation of probability that still makes sense from an objective, omniscient standpoint. The omniscient Bayesian would have a posterior probability distribution that places 100% credence on a single possible universe, eliminating all fractional probabilities — so Bayesianism falls short of this goal.
The alternative to Bayesianism is called frequentism. A frequentist defines the probability of an event to be the limit of its relative frequency as you repeat it more and more times. A coin has a 50% chance of landing heads because if you flip it 100 times, close to 50 of the flips will be heads. In contrast with Bayesianism, the frequentist view is perfectly objective: the limit of a ratio will be the same no matter who observes it.
But the problem with frequentism is that it only makes sense when you’re talking about a well-defined repeatable random experiment like a coin flip. How would a frequentist define the probability that Trump wins the election? It’s not like we can just run the election 100 times in a row and take the average — by definition, the 2024 election is a non-repeatable historical event. We could consider simulating the same election over and over, but what initial conditions do we use for each trial? Frequentism doesn’t give us a recipe for how to define these simulations. This post will be my attempt to generalize the frequentist view by providing this recipe.
Bayesianism is rooted in uncertainty, so it is inherently subjective. Frequentism only applies to black-boxed repeatable experiments, so it struggles at describing events in the physical universe. Now I present a third view of probability that solves these two problems. I call this the Butterfly’s View.
The Butterfly Effect
On a perfect pool table, it is only possible to predict nine collisions before you have to take into account the gravitational force of a person standing in the room.[1] Even an imperceptible change in the initial conditions becomes noticeable after just a few seconds. This is known as the “Butterfly Effect” — the idea that if you make a tiny change to a complex deterministic system, that change will propagate and compound at an exponential rate. This makes it extremely hard (though not impossible in theory) to predict the state of a chaotic physical system, even over short time periods.
I believe that almost every aspect of our physical universe has the same chaotic properties as the pool table. The Brownian motion of air molecules, the complex firing patterns of neurons in our brain, and the turbulent flow of ocean currents are all extremely sensitive to changes. One tiny nudge could completely change the course of history.
How might this happen? Consider the consequences of adding a single electron at the edge of the observable universe. The gravitational pull of this electron is enough to disrupt the trajectories of all air molecules on Earth after only 50 collisions… a fraction of a microsecond. This changes the atmospheric noise that random.org uses to seed its random number generators, which changes the order in which my Spotify playlist gets shuffled,[2] which subtly affects my current mental state, which causes me to write this sentence with a different word order, and so on. In a matter of minutes, human events are unfolding in a measurably different fashion than they would have had that electron never existed.
The Formalization
We can use the random-seeming chaos generated by the Butterfly Effect to define a new notion of probability in a deterministic system. Informally, the “Butterfly Probability” of an event is the percentage of small perturbations to the current universe that result in that event occurring. To be more precise, I’ve come up with the following formalization.
Let be the space of all possible universes. Every represents some particular arrangement of elementary particles (along with their velocities, spins, and so on) at a particular point in time. Think of this as a “snapshot” of a universe. One of the elements of is a snapshot of our current universe — let’s call it .
Since we’re assuming a deterministic version of physics, we have some transition function . This function takes in a universe snapshot and some , then outputs the configuration that this universe will be in after seconds.[3]
Now we define a distance metric on . This function takes in two universes , and tells us how physically different they are. Think of this as an “edit distance” on the physical structure of the universe. We can use a definition along the lines of the following:
In one operation, you can pay $ to translate one atom/particle by meters or to change the velocity of an atom/particle by m/s. You can also pay $ to add or delete an atom/particle of mass . Then is defined to be the minimum cost of any series of operations that transforms into .
The exact details of are unimportant. All that matters is that is a valid distance function on that gets very close to when comparing two universes that are basically identical.
Now we’re ready to define probability! Say we have some predicate such as “Trump wins in 2024”. takes in a universe snapshot and tells us whether or not a given event has occurred. If doesn’t make sense on a given universe (for example, if you try to plug in a universe that has no Earth, or nobody named “Trump”), then it outputs . Then we define the function (parameterized by ) with type signature :
Notice the subscript: . This means that we are drawing a universe uniformly at random[4] from all universes that satisfy . In other words, it is sampling from the -ball of universes around . The final value of is the proportion of universes in this -ball that end up with being realized within the given timeframe . Essentially, is asking something like: “Given a uniformly random small perturbation to our universe, what’s the probability that it results in Trump getting elected in 2024?”[5]
But we’re not done yet. Our function is still parameterized by . Choosing different values of will result in different answers — which one do we mean by the probability of ?
Consider the following graph which shows how changes as approaches .
Let’s focus on the left part of this graph first. As we decrease , the value of collapses to or for values of that are extremely small — small enough that even the Butterfly Effect doesn’t have enough time to produce much variation in the resultant universes. In other words, this happens when is so small that we get the chain of implications
Assuming that the Butterfly Effect acts exponentially, this would require to be double exponentially small (something like ).
But now look at the behavior of when isn’t super close to . As grows, converges to a certain value, marked by the dotted line. This is what we want to express by the true, unparameterized, “Butterfly’s Probability” of — the value that hovers around when is small, but not extremely small.
Why do we need to be small? Well, let’s see what happens when we zoom out:
For a while stays along the dotted line, but as it grows it starts to stray away. I added a few bumps in the graph because I’m not sure what the exact shape here would look like, and it probably depends on what is. Finally, when is very big, tends towards because the -ball will mostly contain universes vastly different from — most of them won’t even have a human being named “Trump”.
So how do we formally define the Butterfly Probability of ? We can’t just write because, as the first graph shows, this would bring us back to our original problem of only having probabilities in . But we also only want to work with values that are relatively small, lest we run into the end behavior of the second graph. So as a compromise, we have to define it as
Def. The Butterfly’s Probability of occurring within time is the value that converges to as , before it collapses to 0 or 1.
I admit that the final caveat in the definition is imprecise, but I am at a loss for how to mathematically formulate this notion of “double convergence”. However, I conjecture that in almost every example of a real-world probability, it will be abundantly clear what this almost-asymptote value should be by simply looking at the graph of . My intuition for this is explained in the next section.
The Intuition
Essentially, I claim that the graph of follows three qualitatively different regions. I’ll redraw it here, with the horizontal scale modified to show all three regions at once.
The behavior in each of these three regions is dominated by a different phenomenon. In the red region, the perturbations are too small for the Butterfly Effect to produce a noticeable difference in the universe over the given timescale. In the green region, the -ball is too big for the changes to be meaningful — the sampled universes will simply be too different. Finally, the blue region describes the sweet spot in which the behavior of universes in the -ball is dominated by the Butterfly Effect.
Our definition of Butterfly Probability relies on the existence of a clear “phase shift” between each of the regions. If the cutoffs between the regions were less stark, then it might be ambiguous which value in the blue region we should count as the true probability. So why do I think that there is an obvious blue region?
My intuition for this is that the Butterfly Effect is so chaotic and sensitive that, once is large enough (but still small enough to have the changes be imperceptible to a human), there can be no large-scale structure in the locations of positive and negative perturbations. Imagine coloring each universe in either black or white, depending on . Then the argument is that in the neighborhood of universes around , the color of each point is basically random — there is no pattern among the clustering of black or white universes. will be equal to the proportion of black to white universes near . It won’t really go up or down as you grow because that would require something like “rings” of darker or lighter regions centered around , which constitutes large-scale structure.
The best way to think about this is by imagining the density of a gas. Fix some point in the air around you. Consider the average density of matter in the sphere centered at with radius — in other words, the total amount of mass contained in the sphere, divided by the volume of the sphere. Start with set to meters, then consider what happens as shrinks to . In the beginning, the density will vary a lot with — the size of the radius will determine what ratio of the sphere is filled with solid matter. This corresponds to the green region above. But then once is small enough (perhaps less than a centimeter), the density stops changing with because the immediate vicinity of air around is homogenous and has some local density. This is the blue region. But once you shrink to be super small, the density collapses to either or kg/m^3, depending on if is located in the empty space between atoms or within a nucleus of an atom. This corresponds to the red region.
My intuition is that the distribution of black and white universes in is qualitatively similar to the distribution of mass in a gas: locally homogenous at a small-but-not-atomically-small scale. This property of — having a continuous “local density” — is really what we’re getting at with Butterfly’s Probability. Local density measures a meaningful property of the universe that is much more robust than the particular color it happens to be.
Conclusion
To say that properly calculating the Butterfly’s Probability of an event is computationally intractable would be the understatement of the century. Calculating even a single probability would require knowing the exact positions of all matter in the universe and the ability to simulate it with near-perfect accuracy. In fact, if the computer which you are using to simulate the universe is itself part of the universe, this leads to paradoxes. Because of this, the value of the Butterfly View formalism done in this blog post is mostly theoretical. It gives us a way to understand what probability would mean from the perspective of God (someone who is completely omniscient and computationally unbounded) without actually being able to carry it out in practice.
However, any time a political scientist or meteorologist builds a big model to predict the future, they are in some sense running an approximation algorithm of a Butterfly’s Probability. In doing so, they make the implicit assumption that the blue region in the graph is large enough that lots of irrelevant information can be left out of the model without much effect on the local density of . For example, a meteorologist may exclude the presence of the Andromeda galaxy from her simulation — but even though a universe without the Andromeda galaxy is quite different from ours, one can hope that it doesn’t make a big difference on the probability of a predicate like “It rains tomorrow”.
I will conclude with what I believe to be the strengths and weaknesses of the Butterfly View as a theoretical framework for understanding probability.
Strengths:
It combines features of the Bayesian and frequentist views: we can talk about the probabilities of one-off events like the 2024 election without the need for an epistemic reference frame or a prior distribution.
It can be applied to any deterministic system without the need for built-in randomness, as long as the system is chaotic enough to exhibit the Butterfly Effect.
It accurately captures what people mean with the colloquial use of the word “random.” When someone says “the stock market is hard to predict… it’s so random,” they probably don’t mean that market volatility is caused by quantum randomness. Instead, it seems to me that they’re trying to describe how there are too many sensitive moving parts for its behavior to be predicted with confidence.
It has a cool name.
Weaknesses:
As mentioned before, it is computationally intractable.
It cannot be adapted to deal with logical uncertainty (i.e. “What’s the probability that the millionth digit of is a ?”). All of the “randomness” in the Butterfly View stems from physical uncertainty. But the decimal expansion of will always be the same no matter how atoms are perturbed, so the Butterfly’s Probability of a mathematical statement is always either or .
It is time-dependent. Over short timeframes (“What’s the probability that this coin flips heads in the next second?”), the Butterfly Effect might not have enough time to make much of a difference, so the probability will either be or . Also, it is impossible to talk about probabilities of past events (unless you plug in a snapshot of the universe from before the event occurred).
As you can see, the Butterfly’s View of probability has one more strength than it has weaknesses, making it a good theory!
Thank you for reading all of these words :).
- ^
This observation is credited to the physicist Michael Berry (1978) and the calculations are explained in this paper. The idea is that, given some tiny error in the angle of a trajectory , the next collision will have an angle error of about , then the next will have an error of , and so on (where is the distance traveled between collisions and is the radius of each ball). So even though might be vanishingly small, the error becomes quite large after only a few collisions.
- ^
OK fine, Spotify doesn’t use random.org to shuffle its playlists, but I’m just trying to give an illustrative example.
- ^
If you prefer an interpretation of physics in which time is discretized (as it is in a cellular automata), you can instead use a single-step transition function . Then you can think of as , where is iterated times.
- ^
We technically haven’t defined a preferred probability distribution on for which we can invoke the phrase “uniformly at random”. I suppose one way you could do this would be to think of as (three spatial components and velocity components for each particle), where is the number of particles in the universe, and weight your probability distribution by -dimensional volume. Or you could think of as being discretized by choosing some super small “precision level” at which to encode positions and velocities. But at this point we’re just getting silly — it really doesn’t matter.
- ^
Don’t let it bother you that this definition involves a . We’re not being circular because we’re only constructing this definition for physical-world probabilities — we’re allowed to assume that the mathematical theory of probability rests on solid ground.
The gravitational waves propagate at the speed of light, so if you add an electron at the edge of the observable universe, I think it will take much more time until any effect of doing so reaches Earth.
(Is this correct? I am not a physicist.)
Ah yes, I think that’s correct (although I am also not a physicist). A more accurate description would be “In a matter of minutes after the time its gravitational waves reach earth, human events are unfolding in a measurably different fashion than they would have had that electron never existed.”
This is great! The issue of timescale is interesting to me in this. I am wondering for different systems at different levels of the ergodic heirarchy, if there are certain statements you can make (when considering the relevant timescales).
Also I am wondering how this plays with the issue of observer models. When I say that some event one month from now has 30% probability, are you imagining that I have a chaotic world model that I somehow run forward many times or push a probability distribution forward in some way and then count the volume in model space that contains the event? How would that process actually work in practice (ie how does my brain do it?).
This is cool, I had never heard of the Ergodic Hierarchy before!
Related to your second point—Alex Cai showed this psychology paper to me. It found that when humans are predicting the behavior of physical systems (e.g. will this stack of blocks fall over?), in their subconscious they are doing exactly this: running the scene in their brain’s internal physics engine with a bunch of initial perturbations/randomness and selecting the majority result. Of course, predicting how a tower of blocks will topple is a lot different from predicting the probability of an event one month into the future.
I think this is a fun exercise. It of course can’t replace the Bayesian model of probability, but it’s conceptually interesting enough as a way to think about chaos.
Having to pick a metric is comparable to having an epistemological frame. Some metrics might have a different “double convergence” than other metrics. If the metrics do not agree then its not really objective.
If probablity of a statement doesn’t make sense is treated as 0 it would seem to me that “I can’t derive that” should also be assigned 0 by the same basis. So the logical uncertainty is defined but just clunky and not particularly inspiring. I would have also thought that the analog would be to have different axiom sets to be the things metrics are defined over.
Or if one wants to insist that logical probabloities are undefined it should also be extended so that the Trump probabilities start to become undefined once the ϵ-sphere starts to include sufficiently alien worlds. That could also be a natural boundary, the largest radius for which the statement is still defined.
One interesting metric to use would be to pick an agent in the world and use the difference in qualia / experience for the distance. That would be “worlds that feel almost like this”. But if these sort of “exotic” metrics are “too partial” there is still work to be done in defining what sort of metrics are “well-behaved”.
A deterministic system has good basis to be time reversible and in the case that it is then past events do have butterfly probabilities. There is analog with quantum probablities, there is no fact of the matter which slit the particle went throught in the double slit experiement. Thus starting moving away from the screen even if the particle is classical only a small velocity vector shift would be required that the ball had been coming from the other slit (both slits having appriciable butterfly amplitude).
Uh, probability is in the map. Uncertainty is in the map. Bayesianism and Frequentism are not at odds. The prior is invisibly the fraction of possible past worlds one can imagine. The probability of an election outcome is the fraction of possible future worlds one can imagine that can emerge from the possible past worlds one had imagined. All that’s needed is fine-graining and counting. There is no need for nonlinearity and chaos.
This also resolves the so called logical uncertainty: the probability of the n-th digit of pi being 0 depends on the agent doing the estimate. Some agents have more detailed and accurate maps than others, and their probabilities may converge with each other, such as that 3^^^^3 digit of pi is 0 with probability 1⁄10, even though it will likely never be calculated by anyone. My personal probability that the 20th digit of pi is 0 with probability 1⁄10, up until I look it up and then it snaps to either 0 or 1, or more like 10^(-5) away from those, since my senses and google can lie to me.
I agree with everything you’re saying. Probability, in the most common sense of “how confident am I that X will occur,” is a property of the map, not the territory.
The next natural question is “does it even make sense for us to define a notion of ‘probability’ as a property of the territory, independent from anyone’s map?” You could argue no, that’s not what probability means; probability is inherently about maps. But the goal of the post is to offer a way to extend the notion of probability to be a property of the territory instead of the map. I think chaos theory is the most natural way to do this.
Another way to view this (pointed out to me by a friend) is: Butterfly Probability is the probability assigned by a Bayesian God who is omniscient about the current state of the universe up to 10^{-50} precision errors in the positions of atoms.
Well, I guess you can say that, due to chaos, even the best map requires probabilities, which, in a way, makes it a feature of the territory, because it is common to all maps.
Probability is only in the map of it isn’t in the territory as well. The theory that it is in the territory as well is not known to be tue, but is scientifically respectable. As Gabriel writes:
Your butterfly formalism strikes me as a good description of what an “objective” probability is (and what ‘frequentists’ actually mean). The problem with the ‘frequentist view’ is best illustrated by your own example:
Saying something is 50% likely because it happens 50% of the time is valid, but it does not actually refer to any real phenomenon. Real coins thrown by real people are not perfectly fair, because angular momentum is crucial, if you let the coin land on a flat surface.
In some sense, nothing is objective, there is only more and less objective. But throwing a die under carefully set up conditions (like in the casino game craps) gets you pretty close to an “objective” probability that multiple humans can agree on.
Let me try rephrase is in more conventional probability theory. You are looking at a metric space of universes (U,d). You probably want to take the Borel-sigma algebra B as your collection of events. We think of propositions as sets A∈B, which really just means A⊂U is a subset which is not too irregular. Then thebindicator function χA(u) is 1 if A holds in universe u and 0 otherwise.
Your elaborations do not depend much on the time so we set t=0.
You now talk about picking a universe uniformly from a ball Bϵ(u0)=u∈U:d(u,u0)<ϵ. This is a problem. On finite dimensional vector spaces we have the lebesgue measurenand we can have such a uniform distribution. On your metric space of universes it is entirely unclear what this means. You have to actually specify a distribution. This choice of distribution then influences your outcome to the extreme. It is similar to how you can not uniformly pick a natural number. So here your result will be strongly influenced by the distribution. What we can do is say the following: We fix a sequence of probability measures ρn on U so that ρn converges to δu0 in the sense of weak convergence of probability measures. What this means is that you choose a sequence of distributions which approximate the dirac delta at u0, the distribution which samples to u0 with probability 1. Then you can say something like: “The butterfly probability decay sequence around u0 with respect to ρn is given by Pu∼ρn(u∈A).
Here I am also not formalizing your sense of “convergence in the middle” because this is extremely unlikely to correapond to somwthing rigorous. You can view the above as a sequence in n and then study it’s decay as n goes to infinity, which corresponds to ϵ going to zero.
But everything here will depend on your choice of ρn. You can not necessarily choose uniformly from a small neighbourhood in any metric space. If the metric space is an infinite dimensional vector space uniformly, this is not possible.
There may be an alternative which means you don’t have to choose the ρn. You can fix a metric betwern probability measures which metrizes weak convergence, for example the Wasserstein distance W. athen you could perhaps look at: supρ:W(ρ,δu0)<ϵPu∼ρ(u∈A).
This may be infinite or zero though.
I’m not quite sure what the point of all of this is… You’ve decided you want to be able to define what a god’s eyes probability for something would be, and indeed come up with what (at least initially) seems like a reasonable definition. But why should I want to define such a thing in the first place, if, as you yourself admit, it isn’t actually useful for anything?
Bayesianism and frequentism both have their limitations.
I often talk about the “true probability” of something (e.g. AGI by 2040). When asked what I mean, I generally say something like “the probability I would have if I had perfect knowledge and unlimited computation”—but that isn’t quite right, because if I had truly perfect knowledge and unlimited computation I would be able to resolve the probability to either 0 or 1. Perfect knowledge and computation within reason, I guess? But that’s kind of hand-wavey. What I’ve actually been meaning is the butterfly probability, and I’m glad this concept/post now exists for me to reference!
More generally I’d say it’s useful to make intuitive concepts more precise, even if it’s hard to actually use the definition, in the same way that I’m glad logical induction has been formalized despite being intractable. Also I’d say that this is an interesting concept, regardless of whether it’s useful :)
How would you ever know what the butterfly probability of something is, such that it would make sense to refer to it? In what context is it useful?
“My probability is 30%, and I’m 50% sure that the butterfly probability is between 20% and 40%” carries useful information, for example. It tells people how confident I am in my probability.