There are multiple versions of the central limit theorem. One common version says that, as n→∞, the mean of means looks like a Gaussian distribution. People in industry sometimes ignore the “goes to infinity” part, and assume that if n is large, that’s good enough. But is that true? When is it safe to assume that your means have converged? I heard of a case recently where the mean of means was assumed to have converged, hadn’t, and the application it was a part of suffered accuracy problems for months without anyone knowing. This sequence will explore: under what conditions does the central limit theorem “fail” like this?
The central limit theorem is about convolutions
There are multiple versions of the central limit theorem. They’re all a version of the statement:
If you have a bunch of distributions fi (say, n of them), and you convolve them all together into a distribution F∗:=f1∗f2∗f3...∗fn, then the larger n is, the more F∗ will resemble a Gaussian distribution.
The simplest version of the central limit theorem requires that the distributions fi must be 1) independent and 2) identically distributed. In this sequence, I’m gonna assume #1 is true. We’ll find that while condition #2 is nice to have, even without it, distributions can converge to a Gaussian under convolution.
A Gaussian distribution is the same thing as a Normal distribution. Some examples of Gaussian distributions:
Wait—this doesn’t sound like the central limit theorem I know
Another statement of the central limit theorem, from Jaynes:
The central limit theorem tells us what is essentially a simple combinatorial fact, that out of all conceivable error vectors e1,...,en that could be generated, the overwhelming majority have about the same degree of cancellation.
Hopefully this is enough to convince you that the central limit theorem need not be stated in terms of random variables or samples from a population.
(Also: the central limit theorems are sometimes talked about in terms of means only. They do say things about means, but they also say very much about distributions. I’ll talk about both.)
The central limit theorem is a statement about the result of a sequence of convolutions. So to understand the central limit theorem, it’s really important to know what convolutions are and to develop a good intuition for them.
Take two functions, f and g. Reverse g on the y-axis, then slide it and f along each other, taking the product of each pair of numbers as you slide. The result is the convolution of fand g. Here’s a picture - f is blue, g is red, and the convolution f∗g is black:
Another one, with a different function f:
You can write this as f∗g=(f∗g)(x)=∫∞−∞f(y)g(x−y)dy.
The first two sections on this page give a nice explanation of convolutions in terms of dropping a ball twice. This page lets you choose two functions to convolve visually, and I highly recommend it for getting a feel for the operation.
Convolution has nice properties
Convolution is associative and commutative—if you want to convolve multiple functions together, it doesn’t matter what order you do it in. (Also, nothing I’ve said in this section requires f and g to be probability distributions—they can be any two functions.)
You can get the mean of a convolution without actually doing the convolutions. If the first moments of f and g are F1 and G1 (for probability distributions, the first moment is the mean), then the first moment of f∗g is F1+G1. There is a hint here about why people often talk about means when they talk about the central limit theorem—being able to get the mean of the convolution by just adding the means of the underlying distributions together is really nice.
You can get the second moment without actually convolving, too. (the second moment is closely related to the variance: if the mean is 0, the second moment and the variance are the same). For f and g with second moments F2, G2, the second moment of f∗g is F2+2F1G1+G2. And so on for higher moments, for as long as f and g actually have those higher moments themselves.
If F{f} and F{g} are the Fourier transforms of f and g, thenF{f∗g}=F{f}F{g}. This is very useful for computing convolutions numerically: evaluating ∫∞−∞f(y)g(x−y)dy naively would require doing an integral for every single x value you were interested in. That’s expensive, time-wise. Much easier to compute F{f} and F{g} and then just multiply them (multiplication is cheap). After that, applying the inverse Fourier transform gives f∗g. This is how convolution is computed in popular numericalprogramming languages.
If f sums to 1 and g sums to 1, then f∗g sums to 1. So if f and g are probability distributions, so is f∗g.
In the next post, I’ll look at what happens when you convolve a bunch of different distributions together, and explore how much of what happens depends on the form of those distributions.
Post script: not neural networks
You may have heard of convolutional neural networks from all their success in image processing. I think they involve the same convolution I’m talking about here, but in much higher dimensions. I’ll only be talking about convolutions of one-dimensional functions in this sequence. Some (or a lot?) of the stuff I’ll say about the central limit theorem probably applies in higher dimensions too, but I’m not sure what changes as the dimension increases. So, this sequence is not about the neural networks.
The central limit theorem in terms of convolutions
There are multiple versions of the central limit theorem. One common version says that, as n→∞, the mean of means looks like a Gaussian distribution. People in industry sometimes ignore the “goes to infinity” part, and assume that if n is large, that’s good enough. But is that true? When is it safe to assume that your means have converged? I heard of a case recently where the mean of means was assumed to have converged, hadn’t, and the application it was a part of suffered accuracy problems for months without anyone knowing. This sequence will explore: under what conditions does the central limit theorem “fail” like this?
The central limit theorem is about convolutions
There are multiple versions of the central limit theorem. They’re all a version of the statement:
The simplest version of the central limit theorem requires that the distributions fi must be 1) independent and 2) identically distributed. In this sequence, I’m gonna assume #1 is true. We’ll find that while condition #2 is nice to have, even without it, distributions can converge to a Gaussian under convolution.
A Gaussian distribution is the same thing as a Normal distribution. Some examples of Gaussian distributions:
Wait—this doesn’t sound like the central limit theorem I know
Most statements of the central limit theorem, including Wikipedia’s, talk in terms of the sums of random variables (and their density functions and expected values). But this is the same thing as our convolution-of-distributions, because the density function of the sum of two random variables X, Y is the convolution of the density functions of X and Y. Looking at the central limit theorem in terms of convolutions will make it easier to see some things. It’s also useful if your version of probability doesn’t have the concept of a random variable, like probability theory as extended logic*.
Another statement of the central limit theorem, from Jaynes:
Hopefully this is enough to convince you that the central limit theorem need not be stated in terms of random variables or samples from a population.
(Also: the central limit theorems are sometimes talked about in terms of means only. They do say things about means, but they also say very much about distributions. I’ll talk about both.)
*: first three chapters here.
Convolutions
The central limit theorem is a statement about the result of a sequence of convolutions. So to understand the central limit theorem, it’s really important to know what convolutions are and to develop a good intuition for them.
Take two functions, f and g. Reverse g on the y-axis, then slide it and f along each other, taking the product of each pair of numbers as you slide. The result is the convolution of fand g. Here’s a picture - f is blue, g is red, and the convolution f∗g is black:
Another one, with a different function f:
You can write this as f∗g=(f∗g)(x)=∫∞−∞f(y)g(x−y)dy.
The first two sections on this page give a nice explanation of convolutions in terms of dropping a ball twice. This page lets you choose two functions to convolve visually, and I highly recommend it for getting a feel for the operation.
Convolution has nice properties
Convolution is associative and commutative—if you want to convolve multiple functions together, it doesn’t matter what order you do it in. (Also, nothing I’ve said in this section requires f and g to be probability distributions—they can be any two functions.)
You can get the mean of a convolution without actually doing the convolutions. If the first moments of f and g are F1 and G1 (for probability distributions, the first moment is the mean), then the first moment of f∗g is F1+G1. There is a hint here about why people often talk about means when they talk about the central limit theorem—being able to get the mean of the convolution by just adding the means of the underlying distributions together is really nice.
You can get the second moment without actually convolving, too. (the second moment is closely related to the variance: if the mean is 0, the second moment and the variance are the same). For f and g with second moments F2, G2, the second moment of f∗g is F2+2F1G1+G2. And so on for higher moments, for as long as f and g actually have those higher moments themselves.
If F{f} and F{g} are the Fourier transforms of f and g, then F{f∗g}=F{f}F{g}. This is very useful for computing convolutions numerically: evaluating ∫∞−∞f(y)g(x−y)dy naively would require doing an integral for every single x value you were interested in. That’s expensive, time-wise. Much easier to compute F{f} and F{g} and then just multiply them (multiplication is cheap). After that, applying the inverse Fourier transform gives f∗g. This is how convolution is computed in popular numerical programming languages.
If f sums to 1 and g sums to 1, then f∗g sums to 1. So if f and g are probability distributions, so is f∗g.
In the next post, I’ll look at what happens when you convolve a bunch of different distributions together, and explore how much of what happens depends on the form of those distributions.
Post script: not neural networks
You may have heard of convolutional neural networks from all their success in image processing. I think they involve the same convolution I’m talking about here, but in much higher dimensions. I’ll only be talking about convolutions of one-dimensional functions in this sequence. Some (or a lot?) of the stuff I’ll say about the central limit theorem probably applies in higher dimensions too, but I’m not sure what changes as the dimension increases. So, this sequence is not about the neural networks.