The Fallacy of Large Numbers
I’ve been seeing this a lot lately, and I don’t think it’s been written about here before
Let’s start with a motivating example. Suppose you have a fleet of 100 cars (or horses, or people, or whatever). For any given car, on any given day, there’s a 3% chance that it’ll be out for repairs (or sick, or attending grandmothers’ funerals, or whatever). For simplicity’s sake, assume all failures are uncorrelated. How many cars can you afford to offer to customers each day? Take a moment to think of a number.
Well, 3% failure means 97% success. So we expect 97 to be available and can afford to offer 97. Does that sound good? Take a moment to answer.
Well, maybe not so good. Sometimes we’ll get unlucky. And not being able to deliver on a contract is painful. Maybe we should reserve 4 and only offer 96. Or maybe we’ll play it very safe and reserve twice the needed number. 6 in reserve, 94 for customers. But is that overkill? Take note of what you’re thinking now.
The likelihood of having more than 4 unavailable is 18%. The likelihood of having more than 6 unavailable is 3.1%. About once a month. Even reserving 8, requiring 9 failures to get you in trouble, gets you in trouble 0.3% of the time. More than once a year. Reserving 9 -- three times the expected—gets the risk down to 0.087% or a little less than every three years. A number we can finally feel safe with.
So much for expected values. What happened to the Law of Large Numbers? Short answer: 100 isn’t large.
The Law of Large Numbers states that for sufficiently large samples, the results look like the expected value (for any reasonable definition of like).
The Fallacy of Large Numbers states that your numbers are sufficiently large.
This doesn’t just apply to expected values. It also applies to looking at a noisy signal and handwaving that the noise will average away with repeated measurements. Before you can say something like that, you need to look at how many measurements, and how much noise, and crank out a lot of calculations. This variant is particularly tricky because you often don’t have numbers on how much noise there is, making it hard to do the calculation. When the calculation is hard, the handwave is more tempting. That doesn’t make it more accurate.
I don’t know of any general tools for saying when statistical approximations become safe. The best thing I know is to spot-check like I did above. Brute-forcing combinatorics sounds scary, but Wolfram Alpha can be your friend (as above). So can python, which has native bignum support. Python has a reputation as being slow for number crunching, but with n<1000 and a modern cpu it usually doesn’t matter.
One warning sign is if your tools were developed in a very different context than where you’re using them. Some approximations were invented for dealing with radioactive decay, where n resembles Avogadro’s Number. Applying these tools to the American population is risky. Some were developed for the American population. Applying them to students in your classroom is risky.
Another danger is that your dataset can shrink. If you’ve validated your tools for your entire dataset, and then thrown out some datapoints and divided the rest along several axes, don’t be surprised if some of your data subsets are now too small for your tools.
This fallacy is related to “assuming events are uncorrelated” and “assuming distributions are normal”. It’s a special case of “choosing statistical tools based on how easy they are to use whether they’re applicable to your use-case or not”.
My number, answer and thoughts were “How much do I gain for supplying a car, and how much do I lose for failing to supply an offered car?”.
Yeah, if you’re an airline, the number might be 105.
?
Airlines regularly oversell flights—they might sell 105 tickets on a flight with 100 seats. They do that because people frequently don’t show up for a flight.
Come to think of it, I’m not actually sure who does this. I’ve probably flown 100 times, and I can only think of one occasion where I’ve not taken a flight that I bought a ticket for. I guess I’m not the typical airline customer.
I think it’s business customers who book flights they might not need, because it’s easier to cancel/not show up than to book in a hurry.
A colleague of mine has on more than one occasion booked several different flights for the same journey, to give himself the flexibility to change his arrangements later. It worked out much cheaper than buying a more expensive ticket that allowed for changes.
For that matter, if you’re 90% sure you’re going to take the flight (which seems reasonable, considering there’s not too many overbooked tickets), you still save money (in expectation) buying the ticket early, since tickets bought far in advance are cheaper.
So maybe it’s the better calibrated customers who book flights they might not need.
Also, how much of a loss can I afford to take before going bankrupt. The lower my cash reserves are, the more I want to play safe.
I was hoping to abstract that away by not specifying “how much money”.
I figured that you might have intended to include it, but thought that somebody might still benefit from it being pointed out explicitly.
It is. The problem is that 3 isn’t large. Having 10000 cars and 0.03% failure rate would give almost exactly the same probability distribution for the number of cars broken on a given day (namely, the Poisson distribution with lambda=3). Even for N = 20 and p = 15% the Poisson distribution would be a decent approximation.
If you have 100 cars and there’s a 50% failure rate, there’d be a standard deviation of five, so you’d need an extra 15 or so cars to be safe. You have to give up almost a third of them. Three not being large is the bigger problem, but 100 still isn’t all that large.
No, it does not:
Note that this talks about “performing the same experiment a large number of times”, which guarantees independence in absence of memory effects. It also talks solely about the sample average, nothing else.
What you probably mean is the Central limit theorem, which assumes no correlation between identically distributed random variables.
I’m not sure what the distinction is. A sample of size n is the same thing as n trials ( though not necessarily n independent trials).
Talking about substituting sample size n*p for p trials of size n makes a lot more sense in a coin-flipping context than it does in, say, epidemiology. If I’m doing a cohort study, I need a group of people to all start being tracked at one timepoint (since that’s how I’m going to try to limit confounding).
Although it’s theoretically possible to keep adding additional people to track, the data get awfully messy. I’d rather not add them in bit by bit. I’d prefer two cohort studies with sample size of n to one study that kept adding to the panel and ended up with 3n people.
Right, but the mathematical meaning of the word “trial” is a little more general, in the sense that even if you pick the sample all at once, you can consider each member of the sample a “trial”.
The usual statistical test is comparing the standard deviation to your measurement precision.
In the car example above, you have a nice binomial distribution, which has a standard deviation of sqrt( N*p(1-p) ).
This is about sqrt(3), which is greater than your measurement precision, which gives you a good idea of what the noise looks like.
How exactly do I apply that test?
If you want to optimize for some outcome (renting cars at a known average price, with some known average penalty for promising cars you don’t have), you can just directly optimize for it.
But if you just want to get a picture of what’s going on, there aren’t going to be any non-arbitrary tests. Comparing the standard deviation to some scale of interest is just a useful piece of information people use to understand the problem. Feel free to set any arbitrary boundaries (or less arbitrary but still not optimal, e.g. “six sigma” business practices) you want.
I don’t have the book handy, but The Quants (about people who tried applying advanced math to the stock market) mentions that Thorpe (the inventor of card counting) did some work on what percentage of your wealth you can bet safely, and that this was ignored by the younger generation of quants.
That’s the Kelly criterion, equivalent to having logarithmic utility for money.
I think the following lectures notes, which I used at SPARC, would be helpful for these sorts of analyses. In particular, the opening section on moment generating functions and the section on the Poisson distribution (although the examples in that section got nuked because the dataset I used was proprietary to Dropbox and I haven’t yet asked for permission to use it beyond SPARC). Apologies for any roughness (if you find typos please let me know).
I don’t have the book handy, but as I recall, The Quants (about people who used advanced math to make investments) mentioned that Thorpe (the inventor of card counting) had math to describe how much of one’s wealth to risk—this was ignored by the younger generation of quants.
Wolfram|Alpha is rarely anyone’s friend.
Really? I’ve fond it an incredibly useful tool for quick checks on whatever-we-were-talking-about.
It’s unclear what the issue is here. The LLN doesn’t say how fast the sample average converges to the mean, it just says that it does.
For the rest, I just see a lot of rambling. I’m reminded of the infamous Teen Talk Barbie.
Fortunately, the LLN isn’t just some black box. It has a proof! And we can look at that proof and get bounds on how quickly the average converges to the mean (which is basically Chebyshev’s inequality, but whatever).
In cases with slightly more regularity to them, we can use Hoeffding’s inequality or something similar and get even better bounds. In fact, this will give results that are almost as good as the assume-it’s-normal strategy, but with the added benefit that you’re actually answering the question you started with, rather than making something up.
Yeah, you can get bounds, and they are what they are.
But I grope in vain for a point to this article. The LLN doesn’t converge as fast as he’d like? Yeah, and sometimes gravity is inconvenient for me, but I don’t post my disgruntlement about it to the list. Somehow his expectations of the rate of convergence have been violated. I suggest he do the calculations you suggest, and educate his expectations.
Excepting those lurking in his own expectations, where’s the fallacy? What is he talking about?
I think you’re being unnecessarily mean.
I do tend toward leaner and meaner over kinder and gentler. I’m trying to be nicer—I actually toned down the second post, and deleted some stuff. Guess not enough for your taste.
But really, do you know what the point is?
I see it as a special case of “the fallacy of everything related to high school statistics”.
(Okay, so I don’t really agree with the answers the post gives. But I think it’s bringing up an interesting point, and hey, this is only discussion. Possibly if we had lots of high-quality math posts, I would feel differently.)
My personal feeling is that where statisticians go wrong is that they think of their problems, not as something you solve, but something you use tools on. But I’m not sure I can articulate this feeling more precisely than that.