This is a rather pedantic remark that doesn’t have much relevance to the primary content of the post (EDIT: it’s also based on a misunderstanding of what the post is actually doing—I missed that an explicit prior is specified which invalidates the concern raised here), but
If such a coin is flipped ten times by someone who doesn’t make literally false statements, who then reports that the 4th, 6th, and 9th flips came up Heads, then the update to our beliefs about the coin depends on what algorithm the not-lying[1] reporter used to decide to report those flips in particular. If they always report the 4th, 6th, and 9th flips independently of the flip outcomes—if there’s no evidential entanglement between the flip outcomes and the choice of which flips get reported—then reported flip-outcomes can be treated the same as flips you observed yourself: three Headses is 3 * 1 = 3 bits of evidence in favor of the hypothesis that the coin is Heads-biased. (So if we were initially 50:50 on the question of which way the coin is biased, our posterior odds after collecting 3 bits of evidence for a Heads-biased coin would be 23:1 = 8:1, or a probability of 8/(1 + 8) ≈ 0.89 that the coin is Heads-biased.)
is not how Bayesian updating would work in this setting. As I’ve explained in my post about Laplace’s rule of succession, if you start with a uniform prior over [0,1] for the probability of the coin coming up heads and you observe a sequence of N heads in succession, you would update to a posterior of Beta(N+1,1) which has mean (N+1)/(N+2). For N=3 that would be 4/5 rather than 8/9.
I haven’t formalized this, but one problem with the entropy approach here is that the distinct bits of information you get about the coin are actually not independent, so they are worth less than one bit each. They aren’t independent because if you know some of them came up heads, your prior that the other ones also came up heads will be higher, since you’ll infer that the coin is likely to have been biased in the direction of coming up heads.
To not leave this totally up in the air, if you think of the nth heads having an information content of
log2(n+1n)
bits, then the total information you get from n heads is something like
n∑k=1log2(k+1k)=log2(n+1)
bits instead of n bits. Neglecting this effect leads you to make much more extreme inferences than would be justified by Bayes’ rule.
if you start with a uniform prior over [0,1] for the probability of the coin coming up heads
I’m not. The post specifies “a coin that is either biased to land Heads 2/3rds of the time, or Tails 2/3rds of the time”—that is (and maybe I should have been more explicit), I’m saying our prior belief about the coin’s bias is just the discrete distribution {”1/3 Heads, 2⁄3 Tails”: 0.5, “2/3 Heads, 1⁄3 Tails”: 0.5}.
I agree that a beta prior would be more “realistic” in the sense of applying to a wider range of scenarios (your uncertainty about a parameter is usually continuous, rather than “it’s either this, or it’s that, with equal probability”), but I wanted to make the math easy on myself and my readers.
Ah, I see. I missed that part of the post for some reason.
In this setup the update you’re doing is fine, but I think measuring the evidence for the hypothesis in terms of “bits” can still mislead people here. You’ve tuned your example so that the likelihood ratio is equal to two and there are only two possible outcomes, while in general there’s no reason for those two values to be equal.
This is a rather pedantic remark that doesn’t have much relevance to the primary content of the post (EDIT: it’s also based on a misunderstanding of what the post is actually doing—I missed that an explicit prior is specified which invalidates the concern raised here), but
is not how Bayesian updating would work in this setting. As I’ve explained in my post about Laplace’s rule of succession, if you start with a uniform prior over [0,1] for the probability of the coin coming up heads and you observe a sequence of N heads in succession, you would update to a posterior of Beta(N+1,1) which has mean (N+1)/(N+2). For N=3 that would be 4/5 rather than 8/9.
I haven’t formalized this, but one problem with the entropy approach here is that the distinct bits of information you get about the coin are actually not independent, so they are worth less than one bit each. They aren’t independent because if you know some of them came up heads, your prior that the other ones also came up heads will be higher, since you’ll infer that the coin is likely to have been biased in the direction of coming up heads.
To not leave this totally up in the air, if you think of the nth heads having an information content of
log2(n+1n)
bits, then the total information you get from n heads is something like
n∑k=1log2(k+1k)=log2(n+1)
bits instead of n bits. Neglecting this effect leads you to make much more extreme inferences than would be justified by Bayes’ rule.
Thanks for this analysis! However—
I’m not. The post specifies “a coin that is either biased to land Heads 2/3rds of the time, or Tails 2/3rds of the time”—that is (and maybe I should have been more explicit), I’m saying our prior belief about the coin’s bias is just the discrete distribution {”1/3 Heads, 2⁄3 Tails”: 0.5, “2/3 Heads, 1⁄3 Tails”: 0.5}.
I agree that a beta prior would be more “realistic” in the sense of applying to a wider range of scenarios (your uncertainty about a parameter is usually continuous, rather than “it’s either this, or it’s that, with equal probability”), but I wanted to make the math easy on myself and my readers.
Ah, I see. I missed that part of the post for some reason.
In this setup the update you’re doing is fine, but I think measuring the evidence for the hypothesis in terms of “bits” can still mislead people here. You’ve tuned your example so that the likelihood ratio is equal to two and there are only two possible outcomes, while in general there’s no reason for those two values to be equal.