(Note: I use the American radix point, except in quotes, where I preserve loldrup’s.)
That beta distribution will have more built in uncertainty if based on a sample size of 100 rather than a sample size of 1.000.000, but that’s the only difference (right?).
Remember that the posterior is the combination of the prior and the likelihood, weighted by the precision of each. The beta(1,1) prior (the famous ‘uniform’ prior) gives us the estimate that 50% of the material a machine outputs is going to be defective. If the true rate is 5%, and we somehow get the mode sample each time, the posterior will be closer to the truth in the (50,001, 950,001) case than in the (6,96) case. If we had the prior belief that, say, 2.4% of the material a machine outputs is defective, and decided our belief was strong enough to justify a (24,976) prior (which has a much higher precision than the (1,1) distribution), you’ll notice that 1M datapoints does much more to correct our faulty prior than 100 datapoints. (In the case where we get a perverse sample, of course, the stronger prior is more resistant.)
Would a solution be to make a Bayesian update for each individual observation of faulty/not-faulty products from machine x? Curiously this would seem to move the problem from a mathematical analysis to a brute force computational task (unless all that Bayesian updating can be neatly modelled)
You may be interested in conjugate priors. If I started off with a beta prior (defined by two parameters, alpha and beta), and I observe an event with a Bernoulli likelihood (a product is faulty or not faulty), then I can immediately calculate the posterior distribution by just adjusting the hyperparameters. If my priors are not conjugate to my likelihood, then I have to do a bunch of integrations to get my new posterior, and this is often done by brute force computation.
I see how this will work for a continuous distribution like the beta distribution. Visually the effect of a high number of samples will be that the curve is more sharply centered on the most probable part of the curve. The outlier cases are more quickly becoming improbable as we move outwards.
But then this must mean that the discrete, “perfect”, “infinite-sample” likelihood distribution used in the Wikipedia example must have a very high influence on the posterior, almost marginalising the effect if the prior. Do I reason correctly here?
And does this “infinite-sample” likelihood distribution really have such a strong effect in the Wikipedia example? (I don’t know how to judge this)
I suspect we should make clear two points under discussion: first, the rate of defective material that a machine spits out, and second, there is the question of how much knowing that material is defective tells us about what machine processed it.
satt’s comment handles the second point; when we are trying to estimate which machine produced a single defective product, the sample size of products is, by necessity, one. (Because we’ve implicitly assumed that the defectivity of products is independent, sampling more of them isn’t really any more interesting than sampling one of them.)
But in order to do that calculation, we need some information about how much defective product each machine produces. As it turns out, we only need the first moment (i.e. mean) of that estimate; higher moments (like the variance) don’t show up in the calculation. (Is it clear to you how to verify that statement?) So a 5% chance that I’m absolutely certain of and a 5% chance that comes from a guess lead to the same final output.
And does this “infinite-sample” likelihood distribution really have such a strong effect in the Wikipedia example? (I don’t know how to judge this)
For many probabilistic calculations, it’s helpful to do a sensitivity analysis. That is, we jiggle around the inputs we gave (like the percentage of the total output that each machine produces, or the defectivity rate of each machine, and so on) to determine how strongly they influence the outcome of the procedure. If we were just guessing with the 5% number, but we discover that dropping it to 4% makes a huge difference, then maybe we should go back and refine our estimate to be sure that it’s 5% instead of 4%. If the number is roughly the same, then our estimate is probably good enough.
If only the mean if the likelihood distribution is involved, not the variance, then truly the sample size used when creating the likelihood distribution has no influence on the Bayesian update.
Then the next question is: is it a problem?
If I understand you correctly then your answer is: “not really, because ”.
Then it’s only the part I don’t get.
You ask me if it’s clear to me why only the mean if the likelihood distribution is involved in the Bayesian update. Well, it isn’t currently, but I’ll read the article “Continuous Bayes” and see if it then becomes more clear to me:
(Note: I use the American radix point, except in quotes, where I preserve loldrup’s.)
Remember that the posterior is the combination of the prior and the likelihood, weighted by the precision of each. The beta(1,1) prior (the famous ‘uniform’ prior) gives us the estimate that 50% of the material a machine outputs is going to be defective. If the true rate is 5%, and we somehow get the mode sample each time, the posterior will be closer to the truth in the (50,001, 950,001) case than in the (6,96) case. If we had the prior belief that, say, 2.4% of the material a machine outputs is defective, and decided our belief was strong enough to justify a (24,976) prior (which has a much higher precision than the (1,1) distribution), you’ll notice that 1M datapoints does much more to correct our faulty prior than 100 datapoints. (In the case where we get a perverse sample, of course, the stronger prior is more resistant.)
You may be interested in conjugate priors. If I started off with a beta prior (defined by two parameters, alpha and beta), and I observe an event with a Bernoulli likelihood (a product is faulty or not faulty), then I can immediately calculate the posterior distribution by just adjusting the hyperparameters. If my priors are not conjugate to my likelihood, then I have to do a bunch of integrations to get my new posterior, and this is often done by brute force computation.
I see how this will work for a continuous distribution like the beta distribution. Visually the effect of a high number of samples will be that the curve is more sharply centered on the most probable part of the curve. The outlier cases are more quickly becoming improbable as we move outwards.
But then this must mean that the discrete, “perfect”, “infinite-sample” likelihood distribution used in the Wikipedia example must have a very high influence on the posterior, almost marginalising the effect if the prior. Do I reason correctly here?
And does this “infinite-sample” likelihood distribution really have such a strong effect in the Wikipedia example? (I don’t know how to judge this)
I suspect we should make clear two points under discussion: first, the rate of defective material that a machine spits out, and second, there is the question of how much knowing that material is defective tells us about what machine processed it.
satt’s comment handles the second point; when we are trying to estimate which machine produced a single defective product, the sample size of products is, by necessity, one. (Because we’ve implicitly assumed that the defectivity of products is independent, sampling more of them isn’t really any more interesting than sampling one of them.)
But in order to do that calculation, we need some information about how much defective product each machine produces. As it turns out, we only need the first moment (i.e. mean) of that estimate; higher moments (like the variance) don’t show up in the calculation. (Is it clear to you how to verify that statement?) So a 5% chance that I’m absolutely certain of and a 5% chance that comes from a guess lead to the same final output.
For many probabilistic calculations, it’s helpful to do a sensitivity analysis. That is, we jiggle around the inputs we gave (like the percentage of the total output that each machine produces, or the defectivity rate of each machine, and so on) to determine how strongly they influence the outcome of the procedure. If we were just guessing with the 5% number, but we discover that dropping it to 4% makes a huge difference, then maybe we should go back and refine our estimate to be sure that it’s 5% instead of 4%. If the number is roughly the same, then our estimate is probably good enough.
If only the mean if the likelihood distribution is involved, not the variance, then truly the sample size used when creating the likelihood distribution has no influence on the Bayesian update.
Then the next question is: is it a problem? If I understand you correctly then your answer is: “not really, because ”.
Then it’s only the part I don’t get.
You ask me if it’s clear to me why only the mean if the likelihood distribution is involved in the Bayesian update. Well, it isn’t currently, but I’ll read the article “Continuous Bayes” and see if it then becomes more clear to me:
http://www.sidhantgodiwala.com/blog/2015/03/14/continuous-bayes/