Averages and sample sizes
My brain seems to not get statistics and probability. So I was wondering about how averaging things out is influenced by the amount of items being averaged, and in particular, how the underlying distribution will change things. Here are some stories to illustrate the issue:
Minting
You’re a Dutch banker working for the famous Wisselbank. Your job is to check whether the gold coins coming in are good. You know that each coin should weigh 25g ± 0.5g (I’m taking a lot of liberties with this story). Your method is to count out 100 coins, and see if they weigh 2.5kg. You just got a batch that weighs 2480g, which is 20g less than expected. But since each coin is ± 0.5g, and you have 100 coins, that should be ± 50g, right? Or does this suggest that someone is trying to cheat you by making slightly lighter coins, and pocketing the difference in gold?
Network errors
You’re on monitoring duty at work. Previous tests showed that something like 2 requests out of every 1000 caused errors for random reasons (e.g. Russian hackers are probing your servers, or a user clicked too fast) and so can be safely ignored (there’s a task to fix that, but keeps being left for the next sprint). You just got an alarm in the middle of the night—over the last hour, there were 6840 network errors, at a request rate of 1k/s. This is 0.19% of all the requests, and the known error rate was 0.2%. Should you go back to sleep and have a look at it in the morning, or is this a real issue?
Party!!
You like organizing parties for your friends. Usually, when you invite 100 people, some people don’t show up, while others bring friends, so you assume that there will be between 90 − 110 people. You want to organize a weekly meeting for the next year, but don’t know how much swag to order. Assuming 100 people have signed up, and you have a very good deal lined up to get all the resources for the whole year today (i.e. for all 52 weeks), how much should you order?
Bird watching
You’re an ornithologist that is studying the lesser spotted woodpecker (or any other species). You want to work out how many of them can be supported by a hectare of woodland. You’re going to visit various patches of forest (or maybe send your friends?) and count how many birds could be found during each trip. You know there is a magical way of calculating this, called hierarchical mixed models or something, but the only guy who actually understands them is currently busy with his habilitation and also has a large backlog of other people’s studies, so it should be enough to just take the average, right? If you check a bunch of places, model it as a Poisson distribution, and take the average, that should be fine, hopefully?
Differing distributions
If you have a distribution with a large spread, will that make the resulting averages more diffused? These stories all have different distributions (respectively: Normal, Binomial, Discrete Uniform and Poisson), so my initial assumption was that they will behave differently when averaged. I guessed that normal distributions should lend well to averaging, but that the others might need some extra mangling or something.
I’d read somewhere long ago (in a very interesting article that I unfortunately can’t find) that the mint story was a real deal in some bank or other, where they’d use this method to check coins. I vaguely recalled that the moral was something like “the larger the sample size, the smaller the expected variance”, but this would have to be checked.
If I had known enough statistics, I’d have realised that this was just a question about the central limit theorem. So seeing as I’m a much better programmer than I am a statistician, I did it the hard way, and started out by writing a simulator to test out various situations.
This is where things started getting tricky. I had assumed that everything is normal. And was trying to model the networks errors story as normally distributed. It’s quite obviously (in retrospect) not. Which leads to one of my insights—always make sure you know what distribution to use for modeling things. This is something I knew. I’d even explained to people that this is important. And yet.
Central limit theorem
The central limit theorem states that if you have independent and identically distributed random variables, then the sampling distribution of a sample mean is approximately normal, if the sample size is large enough. Notice that this doesn’t say anything about the underlying distribution, only about the sample mean—if you take the average of a (large enough) sample from any (with the appropriate caveats) distribution, you get a bell curve. This is huge! And pretty obvious, as it pops up everywhere… It’s basically the idea behind replicating studies, meta-studies etc. - if you take 100 nutrition studies about whether tomatoes cause cancer and average them out, you’ll get something near the real value, even if the original studies were all over the place. With the appropriate caveats, of course.
This much I knew. Maybe not explicitly, but it’s hard to spend much time around Science™ and not realise that taking the average of things is a powerful technique. Especially when you find out how the law of large numbers works (of which the CLT is just an extension). The really interesting thing, though, is that how it works is relative to the sample size.
The CLT says that (i.e. the sample mean) has a distribution that is approximately Normal with mean and variance , where here is the variance of the underlying distribution. This means that the standard deviation is . Notice the square root there—this is crucial. The standard deviation of a sample mean is proportional to the square root of the sample size. Which implies quickly diminishing returns, what with this being nonlinear. The moral here being that bigger might well be better, but not by all that much—you get a lot more juice going from sample sizes of 10 → 100 items, than from going 100 → 1000. Reducing the standard deviation by requires more data points.
CLT in practice
How does this apply to the examples from the beginning? Let’s give them a whirl:
Minting
This is a case where the underlying random variable (the weight of the coin) can be modeled as a Normal distribution with a mean of 25g and a variance of 1 (this gives a variance of ± 0.5g—to make it more of a hard limit, something like a variance of 0.05 would have to be used).
Running 1000 simulations of batches of 100 items gives a std dev of 0.07. Which makes sense, as . This is reflected in the graph, where the average weights are between 24.8 − 25.25g. Using batches 10 times larger only gives ~2 times smaller std devs, which is good, but would involve a lot more work.
This should totally suffice to catch your non sophisticated cheater! At least the type who makes each coin weigh e.g. 24.7g in the hope that you won’t notice. If they just shave off a bit of every 10th coin, or made coins that were on average 24.5g but every now and then tossed in a heavier one, would that also work? I haven’t gone deep into this, but it seems like it will also work—as the resulting distribution will also have a mean and standard deviation, which will in most cases be different from the expected one.
Network errors
Network errors (assuming they’re independent—heh!) can be modeled as a Binomial distribution, i.e.n
trials, each of which has probability p
of succeeding. In this case, with 10k requests/s and a 0.2% error rate, that’s Binomial(1000, 0.002)
. Let’s assume that metrics are collected per minute, which gives us batches of 60s * 1k requests
, which is 60 000
trials per batch. There are 60 minutes in each hour, so in total that should be ~3 600 000
requests. Out of that 0.2% should be errors, so we expect to get 7200 errors, or 3,592,800 successful requests. In the example, there were 6800 errors. Is that ok?
Running 1000 simulations gives on average around 3,592,775 successes, which is almost ideal, according to the calculations (60k is a massive sample, after all). That being said, 6840 errors were pretty much never generated in any of the 1000 trials—there were always more. So even though 0.2% is seems pretty much 0.19%, when there is such a large sample size, it makes a large difference. Does this also check out in theory? The experimental standard deviation for the error counts from the sample mean was 86.7. This means that around 68% of the per minute error counts should be 7200 ± 87. And around 96% should be 7200 ± 174. This is a very narrow band. Which is what you get from such a large sample size. This also checks out if one does the explicit calculations. The variance for a Binomial distribution is , where is the number of trials, and the probability of success. So in this case and . The standard deviation for 60 batches should then be . Multiplying this by 60 (the number of batches) will transform it from the std dev of the averages to the std dev of the counts, which checks out, as , which is close enough to the experiential result.
There were 6840 errors in the example, which is 360 fewer than expected, and 4.4 standard deviations away from the 0.2% error rate mean. This means that something fishy is going on, but in a positive way? As there are fewer errors than expected? Interesting… It can probably wait till tomorrow, though. This is fine…
Party
This one will assume a (discrete) uniform distribution, i.e. any value between 90-110 is equally likely. A Poisson or Binomial would probably be more realistic here, but it doesn’t really matter. In this case the batch size is 52, as the idea is to have a meeting each week.
1000 simulations resulted in an average of 100 (it would be strange if it wasn’t…) and a standard deviation of 0.84. Does this make sense in theory? The variance of a uniform distribution is , where are the min and max values. So in this case, . The expected std deviation for 52 meetings should then be
. Which is exactly the same as the simulation generated.
Since 3 standard deviations contain 99.73% of the values (via the 68-96-99.7 rule), it seems that getting enough stuff for people, i.e. enough for between 5044 and 5356 people should be totally fine. This is a lot better than just getting enough for people, as that would be a range between 4680 and 5720, or a difference of 1040 vs 312 - that’s a 3 times larger span! Getting enough for 5300 people seems like a decent enough amount.
Bird watching
Here it’s assumed that , i.e. that there were more or less 3 woodpeckers found in each place. To make this more realistic, lets assume that you have counts from 10 previous trips, which are [4, 5, 1, 6, 3, 4, 2, 1, 1, 0]
. You ask ChatGPT to estimate the parameter for you, as one does, and it replies that:
The lambda parameter of a Poisson distribution is typically estimated as the mean of the data. To calculate this for the provided data
[4, 5, 1, 6, 3, 4, 2, 1, 1, 0]
, you would add all the numbers together and then divide by the count of the numbers.Sum of the values:
4 + 5 + 1 + 6 + 3 + 4 + 2 + 1 + 1 + 0 = 27
Count of the values:10
Mean of the values:27 / 10 = 2.7
So, the estimated lambda parameter for this data is
2.7
.
So far, so good. But 10 checks doesn’t seem that much. How many times should you go counting for this to be certain? 10 counts gives a standard deviation of 0.54. How much would be needed to get 3 std deviations to be around 1 (as this means that 99% of the time you’d have the correct lambda)?
The variance of a Poisson distribution is just its parameter, which is probably going to be around 3. It would be nice to have a sample size large enough that the sample mean will be ± 0.5 away from the real value of 99% of the time (assuming that all woodland patches are and will be identical, including weather, time of day etc. - isn’t this fun!). So what you want is to find such that , where (as 99% is 3 std deviations) and (the variance of a Poisson distribution is , so it has to be square rooted here). A few transformations later, you have .
Does that check out? 1000 simulations of batches of 27 items from a Poisson distribution with results in a std deviation of 0.33. Which is exactly what was desired! So you can now go and check another 17 patches to get your missing data, and then finally publish yet another paper! Once you work out how to make a p-value for this...
Lessons learnt
always think about the underlying distributions when modeling things
taking an average of a batch of measurements really works and will reflect the underlying distribution of the measured thing
the averages of samples from most distributions have a normal distribution → probability statements about the averages of batches of measurements can be approximated with a normal distribution (but not necessarily the underlying variables!)
the marginal value of collecting more data quickly diminishes—it scales with . This would explain why so many examples, lessons, experiments etc. use a sample count of 30 - that often really will suffice
as a rule of thumb, a sample size of 100 is only twice as good as 30, while 10k is only 10 times better than 100
Note that in adversarial (or potentially adversarial) situations, error is not independent and identically distributed. If your acceptance spec for gold coins is “25g +/- 0.5g”, you should expect your suppliers to mostly give you coins near 24.5g. Network errors are also correlated, either because they ARE an attack, or because some specific component or configuration is causing it.
I wanted to test log-normal but couldn’t include pictures so made it a brief post.
Spoiler alert: It works
https://www.lesswrong.com/posts/GkEW4vH6M6pMdowdN/averaging-samples-from-a-population-with-log-normal