My expectation of “this coin is biased” did not change
In this particular example, no, it did not. However if you switch to continuous probabilities (and think not in terms of binary is-biased/is-not-biased but rather in terms of the probability of the true mean not being 0.5 plus-minus epsilon) your estimate of the character of the coin will change.
Also
“my expectation of the next result of this coin” changed
and
but I didn’t change my expectation that from the next 1000 flips approximately 500 will be heads.
But I don’t really expect to see 3½ as an outcome of the roll. I expect to see either 1, or 2, or 3, or 4, or 5, or 6. But certainly not 3½.
If my model says that 0.2 coins are heads-only and 0.8 coins are fair, in 1000 flips I expect to see either 1000 heads (probability 0.2) or cca 500 heads (probability 0.8). But I don’t expect to see cca 600 heads. Yet, the expected value of the number of heads in 1000 flips is 600.
You can only multiply out P(next result is heads) * ( number of tosses) to get the expected number of heads if you believe those tosses are independent trials. The case of a biased coin toss explicitly violates this assumption.
But the tosses are independent trials, even for the biased coin. I think you mean the P(heads) is not 0.6, it’s either 0.5 or 1, you just don’t know which one it is.
Which means that P(heads on toss after next|heads on next toss) != P(heads on toss after next|tails on next toss). Independence of A and B means that P(A|B) = P(A).
As long as you’re using the same coin, P(heads on toss after next|heads on next toss) == P(heads on toss after next|tails on next toss).
You’re confusing the probability of coin toss outcome with your knowledge about it.
Consider a RNG which generates independent samples from a normal distrubution centered on some—unknown to you—value mu. As you see more samples you get a better idea of what mu is and your expectations about what numbers you are going to see next change. But these samples do not become dependent just because your knowledge of mu changes.
We have a coin that is heads-only with probability 20%, and fair with probability 80%. We’ve already conducted exactly one flip of this coin, which came out heads (causing out update from the prior of 10/80/10 to 20/80/0), but no further flips yet.
For simplicity, event A will be “heads on next toss” (toss number 2), and B will be “heads on toss after next” (toss number 3).
It’s awful that you were downvoted in this thread when you were mostly right and the others were mostly wrong. I’m updating my estimate of LW’s average intelligence downward.
No it doesn’t! A coin biased towards heads can have p(H) = 0.6, p(T) = 0.4, and each flip can be an independent trial. The total results from many flips will then be Poisson distributed.
I don’t think so. None of the available potential coin-states would generate an expected value of 600 heads.
p = 0.6 → 600 expected heads is the many-trials (where each trial is 1000 flips) expected value given the prior and the result of the first flip, but this is different from the expectation of this trial, which is bimodally distributed at [1000]x0.2 and [central limit around 500]x0.8
However if you switch to continuous probabilities your estimate of the character of the coin will change.
No. If the distribution is symmetrical, then the probability density at .5 will be unchanged after a single coin toss.
these two statements contradict each other.
No they don’t. He was saying that his estimate of the probability that the coin is unbiased (or approximately unbiased) does not change, but that the probability that the coin is weighted towards heads increased at the expense of the probability that the coin is weighted towards tails (or vice-versa, depending on the outcome of the first toss), which is correct.
If the distribution is symmetrical, then the probability density at .5 will be unchanged after a single coin toss.
In the continuous-distribution world the probability density at exactly 0.5 is infinitesimally small. And the probability density at 0.5 plus-minus epsilon will change.
No they don’t.
Yes, they do. We’re talking about expected values of coin tosses now, not about the probabilities of the coin being biased.
the probability mass at 0.5 plus-minus epsilon will change.
(army1987 already addressed density vs mass.) No, for any x, the probability density at 0.5+x goes up by the same amount that the probability density at 0.5-x goes down (assuming a symmetrical prior), so for any x, the probability mass in [0.5-x, 0.5+x] will remain exactly the same.
We’re talking about expected values of coin tosses now, not about the probabilities of the coin being biased.
Ok, instead of 1000 flips, think about the next 2 flips. The probability that exactly 1 of them lands heads does not change. This does not contradict the claim that the probability of the next flip being heads increases, because the probability of the next two flips both being heads increases while the probability of the next two flips both being tails decreases by the same amount (assuming you just saw the coin land heads).
You don’t even need to explicitly use Bayes’s theorem and do the math to see this (though you can). It all follows from symmetry and conservation of expected evidence. By symmetry, the change in probability of some event which is symmetric with respect to heads/tails must change by the same amount whether the result of the first flip is heads or tails, and by conservation of expected evidence, those changes must add to 0. Therefore those changes are 0.
for any x, the probability density at 0.5+x goes up by the same amount that the probability density at 0.5-x goes down (assuming a symmetrical prior)
I don’t think that is true. Imagine that your probability density is a normal distribution. You update in such a way that the mean changes, 0.5 is no longer the peak. This means that your probability density is no longer symmetrical around 0.5 (even if you started with a symmetrical prior) and the probability density line is not a 45 degree straight line—with the result that the density at 0.5+x changes by a different amount than at 0.5-x.
You update in such a way that the mean changes, 0.5 is no longer the peak. This means that your probability density is no longer symmetrical around 0.5 (even if you started with a symmetrical prior)
That is correct. Your probability distribution is no longer symmetrical after the first flip, which means that on the second flip, the symmetry argument I made above no longer holds, and you get information about whether the coin is biased or approximately fair. That doesn’t matter for the first flip though. Did you read the last paragraph in my previous comment? If so, was any part of it unclear?
with the result that the density at 0.5+x changes by a different amount than at 0.5-x.
That does not follow from anything you wrote before it (the 45 degree straight line part is particularly irrelevant).
Hm. Interesting how what looks like a trivially simple situation can become so confusing. Let me try to walk through my reasoning and see what’s going on...
We have a coin and we would like to know whether it’s fair. For convenience let’s define heads as 1 and tails as 0, one consequence of that is that we can think of the coin as a bitstring generator. What does it mean for a coin to be fair? It means that expected value of the coin’s bitstring is 0.5. That’s the same thing as saying that the mean of the sample bitstring converges to 0.5.
Can we know for certain that the coin is fair on the basis of examining its bitsting? No, we can not. Therefore we need to introduce the concept of acceptable certainty, that is, the threshold beyond which we think that the chance of the coin being fair is high enough (that’s the same concept as the p-value). In frequentist statistics we will just run an exact binomial test, but Bayes makes things a bit more complicated.
Luckily, Gelman in Bayesian Data Analysis looks exactly at this case (2nd ed., pp.33-34). Assuming a uniform prior on [0,1] the posterior distribution for theta (which in our case is the probability of the coin coming up heads or generating a 1) is
p( th | y ) is proportional to (th ^ y) * ((1 - th)^(n—y))
where y is the number of heads and n is the number of trials.
After the first flip y=1, n=1 and so p( th | 1) is proportional to ( th )
Aha, this is interesting. Our prior was uniform so the density was just a straight horizontal line. After the first toss the line is still straight but is now sloping up with the minimum at zero and the maximum at 1.
So the expected value of the mean of our bitstring used to be 0.5 but is now greater than 0.5. And that is why I argued that the very first toss changes your expectations: your expected bitstring mean (= expected probability of the coin coming up heads) is now no longer 0.5 and so you don’t think that the coin is fair (because the fair coin’s expected mean is 0.5).
But that’s only one way of looking at it and now I see the error of my ways. After the first toss our probability density is still a straight line and it pivoted around the 0.5 point. This means that the probability mass in some neighborhood of [0.5-x, 0.5+x] did not change and so the probability of the coin being fair remains the same. The change in the expected value is because we think that if the coin is biased, it’s more likely to be biased towards heads than towards tails.
And yet this works because we started with a uniform prior, a straight density line. What if we start with a different, “curvier” prior? After the first toss the probability density should still pivot around the 0.5 point but because it’s not a straight line the probability mass in [0.5-x, 0.5+x] will not necessarily remain the same. Hmm… I don’t have time right now to play with it, but it requires some further thought.
What if we start with a different, “curvier” prior? After the first toss the probability density should still pivot around the 0.5 point but because it’s not a straight line the probability mass in [0.5-x, 0.5+x] will not necessarily remain the same.
Provided the prior is symmetrical, the probability mass in [0.5-x, 0.5+x] will remain the same after the first toss by the argument I sketched above, even though the probability density will not be a straight line. On subsequent tosses, of course, that will no longer be true. If you have flipped more heads than tails, then your probability distribution will be skewed, so flipping heads again will decrease the probability of the coin being fair, while flipping tails will increase the probability of the coin being fair. If you have flipped the same (nonzero) number of heads as tails so far, then your probability distribution will be different than it was when you started, but it will still be symmetrical, so the next flip does not change the probability of the coin being fair.
In this particular example, no, it did not. However if you switch to continuous probabilities (and think not in terms of binary is-biased/is-not-biased but rather in terms of the probability of the true mean not being 0.5 plus-minus epsilon) your estimate of the character of the coin will change.
Also
and
-- these two statements contradict each other.
Using my simplest example, because it’s simplest to calculate:
Prior:
0.8 fair coin, 0.1 heads-only coin, 0.1 tails-only coin
probability “next is head” = 0.5
probability “next 1000 flips are approximately 500:500” ~ 0.8
Posterior:
0.8 fair coin, 0.2 heads-only coin
probability “next is head” = 0.6 (increased)
probability “next 1000 flips are approximately 500:500” ~ 0.8 (didn’t change)
Um.
Probability of a head = 0.5 necessarily means that the expected number of heads in 1000 tosses is 500.
Probability of a head = 0.6 necessarily means that the expected number of heads in 1000 tosses is 600.
Are you playing with two different meanings of the word “expected” here?
If I roll a 6-sided die, the expected value is 3½.
But I don’t really expect to see 3½ as an outcome of the roll. I expect to see either 1, or 2, or 3, or 4, or 5, or 6. But certainly not 3½.
If my model says that 0.2 coins are heads-only and 0.8 coins are fair, in 1000 flips I expect to see either 1000 heads (probability 0.2) or cca 500 heads (probability 0.8). But I don’t expect to see cca 600 heads. Yet, the expected value of the number of heads in 1000 flips is 600.
No, I’m just using the word in the statistical-standard sense of “expected value”.
Lumifer was using the word “expected” correctly.
You can only multiply out P(next result is heads) * ( number of tosses) to get the expected number of heads if you believe those tosses are independent trials. The case of a biased coin toss explicitly violates this assumption.
But the tosses are independent trials, even for the biased coin. I think you mean the P(heads) is not 0.6, it’s either 0.5 or 1, you just don’t know which one it is.
Which means that P(heads on toss after next|heads on next toss) != P(heads on toss after next|tails on next toss). Independence of A and B means that P(A|B) = P(A).
As long as you’re using the same coin, P(heads on toss after next|heads on next toss) == P(heads on toss after next|tails on next toss).
You’re confusing the probability of coin toss outcome with your knowledge about it.
Consider a RNG which generates independent samples from a normal distrubution centered on some—unknown to you—value mu. As you see more samples you get a better idea of what mu is and your expectations about what numbers you are going to see next change. But these samples do not become dependent just because your knowledge of mu changes.
Please actually do your math here.
We have a coin that is heads-only with probability 20%, and fair with probability 80%. We’ve already conducted exactly one flip of this coin, which came out heads (causing out update from the prior of 10/80/10 to 20/80/0), but no further flips yet.
For simplicity, event A will be “heads on next toss” (toss number 2), and B will be “heads on toss after next” (toss number 3).
P(A) = 0.2 1 + 0.8 0.5 = 0.6 P(B) = 0.2 1 + 0.8 0.5 = 0.6
P(A & B) = 0.2 1 1 + 0.8 0.5 0.5 = 0.4
Note that this is not the same as P(A) P(B), which is 0.6 0.6 = 0.36.
The definition of independence is that A and B are independent iff P(A & B) = P(A) * P(B). These events are not independent.
Turning the math crank without understanding what you are doing is worse than useless.
Our issue is about how to understand probability, not which numbers come out of chute.
It’s awful that you were downvoted in this thread when you were mostly right and the others were mostly wrong. I’m updating my estimate of LW’s average intelligence downward.
No it doesn’t! A coin biased towards heads can have p(H) = 0.6, p(T) = 0.4, and each flip can be an independent trial. The total results from many flips will then be Poisson distributed.
I don’t think so. None of the available potential coin-states would generate an expected value of 600 heads.
p = 0.6 → 600 expected heads is the many-trials (where each trial is 1000 flips) expected value given the prior and the result of the first flip, but this is different from the expectation of this trial, which is bimodally distributed at [1000]x0.2 and [central limit around 500]x0.8
No. If the distribution is symmetrical, then the probability density at .5 will be unchanged after a single coin toss.
No they don’t. He was saying that his estimate of the probability that the coin is unbiased (or approximately unbiased) does not change, but that the probability that the coin is weighted towards heads increased at the expense of the probability that the coin is weighted towards tails (or vice-versa, depending on the outcome of the first toss), which is correct.
In the continuous-distribution world the probability density at exactly 0.5 is infinitesimally small. And the probability density at 0.5 plus-minus epsilon will change.
Yes, they do. We’re talking about expected values of coin tosses now, not about the probabilities of the coin being biased.
That’s not what a probability density is. You’re thinking of a probability mass.
Yes, you are right.
(army1987 already addressed density vs mass.) No, for any x, the probability density at 0.5+x goes up by the same amount that the probability density at 0.5-x goes down (assuming a symmetrical prior), so for any x, the probability mass in [0.5-x, 0.5+x] will remain exactly the same.
Ok, instead of 1000 flips, think about the next 2 flips. The probability that exactly 1 of them lands heads does not change. This does not contradict the claim that the probability of the next flip being heads increases, because the probability of the next two flips both being heads increases while the probability of the next two flips both being tails decreases by the same amount (assuming you just saw the coin land heads).
You don’t even need to explicitly use Bayes’s theorem and do the math to see this (though you can). It all follows from symmetry and conservation of expected evidence. By symmetry, the change in probability of some event which is symmetric with respect to heads/tails must change by the same amount whether the result of the first flip is heads or tails, and by conservation of expected evidence, those changes must add to 0. Therefore those changes are 0.
I don’t think that is true. Imagine that your probability density is a normal distribution. You update in such a way that the mean changes, 0.5 is no longer the peak. This means that your probability density is no longer symmetrical around 0.5 (even if you started with a symmetrical prior) and the probability density line is not a 45 degree straight line—with the result that the density at 0.5+x changes by a different amount than at 0.5-x.
That is correct. Your probability distribution is no longer symmetrical after the first flip, which means that on the second flip, the symmetry argument I made above no longer holds, and you get information about whether the coin is biased or approximately fair. That doesn’t matter for the first flip though. Did you read the last paragraph in my previous comment? If so, was any part of it unclear?
That does not follow from anything you wrote before it (the 45 degree straight line part is particularly irrelevant).
Hm. Interesting how what looks like a trivially simple situation can become so confusing. Let me try to walk through my reasoning and see what’s going on...
We have a coin and we would like to know whether it’s fair. For convenience let’s define heads as 1 and tails as 0, one consequence of that is that we can think of the coin as a bitstring generator. What does it mean for a coin to be fair? It means that expected value of the coin’s bitstring is 0.5. That’s the same thing as saying that the mean of the sample bitstring converges to 0.5.
Can we know for certain that the coin is fair on the basis of examining its bitsting? No, we can not. Therefore we need to introduce the concept of acceptable certainty, that is, the threshold beyond which we think that the chance of the coin being fair is high enough (that’s the same concept as the p-value). In frequentist statistics we will just run an exact binomial test, but Bayes makes things a bit more complicated.
Luckily, Gelman in Bayesian Data Analysis looks exactly at this case (2nd ed., pp.33-34). Assuming a uniform prior on [0,1] the posterior distribution for theta (which in our case is the probability of the coin coming up heads or generating a 1) is
p( th | y ) is proportional to (th ^ y) * ((1 - th)^(n—y))
where y is the number of heads and n is the number of trials.
After the first flip y=1, n=1 and so p( th | 1) is proportional to ( th )
Aha, this is interesting. Our prior was uniform so the density was just a straight horizontal line. After the first toss the line is still straight but is now sloping up with the minimum at zero and the maximum at 1.
So the expected value of the mean of our bitstring used to be 0.5 but is now greater than 0.5. And that is why I argued that the very first toss changes your expectations: your expected bitstring mean (= expected probability of the coin coming up heads) is now no longer 0.5 and so you don’t think that the coin is fair (because the fair coin’s expected mean is 0.5).
But that’s only one way of looking at it and now I see the error of my ways. After the first toss our probability density is still a straight line and it pivoted around the 0.5 point. This means that the probability mass in some neighborhood of [0.5-x, 0.5+x] did not change and so the probability of the coin being fair remains the same. The change in the expected value is because we think that if the coin is biased, it’s more likely to be biased towards heads than towards tails.
And yet this works because we started with a uniform prior, a straight density line. What if we start with a different, “curvier” prior? After the first toss the probability density should still pivot around the 0.5 point but because it’s not a straight line the probability mass in [0.5-x, 0.5+x] will not necessarily remain the same. Hmm… I don’t have time right now to play with it, but it requires some further thought.
Yes.
Provided the prior is symmetrical, the probability mass in [0.5-x, 0.5+x] will remain the same after the first toss by the argument I sketched above, even though the probability density will not be a straight line. On subsequent tosses, of course, that will no longer be true. If you have flipped more heads than tails, then your probability distribution will be skewed, so flipping heads again will decrease the probability of the coin being fair, while flipping tails will increase the probability of the coin being fair. If you have flipped the same (nonzero) number of heads as tails so far, then your probability distribution will be different than it was when you started, but it will still be symmetrical, so the next flip does not change the probability of the coin being fair.