Basic question about bits of evidence vs. bits of information:
I want to know the value of a random bit. I’m collecting evidence about the value of this bit.
First off, it seems weird to say “I have 33 bits of evidence that this bit is a 1.” What is a bit of evidence, if it takes an infinite number of bits of evidence to get 1 bit of information?
Second, each bit of evidence gives you a likelihood multiplier of 2. E.g., a piece of evidence that says the likelihood is 4:1 that the bit is a 1 gives you 2 bits of evidence about the value of that bit. Independent evidence that says the likelihood is 2:1 gives you 1 bit of evidence.
But that means a one-bit evidence-giver is someone who is right 2⁄3 of the time. Why 2/3?
Finally, if you knew nothing about the bit, and had the probability distribution Q = (P(1)=.5, P(0)=.5), and a one-bit evidence giver gave you 1 bit saying it was a 1, you now have the distribution P = (2/3, 1⁄3). The KL divergence of Q from P (log base 2) is only 0.0817, so it looks like you’ve gained .08 bits of information from your 1 bit of evidence. ???
I think I was wrong to say that 1 bit evidence = likelihood multiplier of 2.
IF you have a signal S, and P(x|S) = 1 while P(x|~S) = .5, then the likelihood multiplier is 2 and you get 1 bit of information, as computed by KL-divergence. That signal did in fact require an infinite amount of evidence to make P(x|S) = 1, I think, so it’s a theoretical signal found only in math problems, like a frictionless surface in physics.
If you have a signal S, and P(x|S) = .5 while P(x|~S) = .25, then the likelihood multiplier is 2, but you get only .2075 bits of information.
There’s a discussion of a similar question on stats.stackexchange.com . It appears that the sum, over a series of observations x, of
log(likelihood ratio = P(x | model 2) / P(x | model 1))
approximates the information gain from changing from model 1 to model 2, but not on a term-by-term basis. The approximation relies on the frequency of the observations in the entire observation series being drawn from a distribution close to model 2.
Yes, there are incompatible uses of the phrase “bits of evidence.” In fact, the likelihood version is not compatible with itself: bits of evidence for Heads is not the same as bits of evidence against Tails. But still it has its place. Odds ratios do have that formal property. You may be interested in this wikipedia article. In that version, a bit of information advantage that you have over the market is the ability to add log(2) to your expected log wealth, betting at the market prices. If you know with certainty the value of the next coin flip, then maybe you can leverage that into arbitrarily large returns, although I think the formalism breaks down at this point.
Why does the likelihood grow exactly twice? (I’m just used to really indirect evidence, which is also seldom binary in the sense that I only get to see whole suits of traits, which usually go together but in some obscure cases, vary in composition. So I guess I have plenty of C-bits that do go in B-bits that might go in A-bits, but how do I measure the change in likelihood of A given C? I know it has to do with d-separation, but if C is something directly observable, like biomass, and B is an abstraction, like species, should I not derive A (an even higher abstraction, like ‘adaptiveness of spending early years in soil’) from C? There are just so much more metrics for C than for B...)
Sorry for the ramble, I just felt stupid enough to ask anyway. If you were distracted from answering the parent, please do.
First off, it seems weird to say “I have 33 bits of evidence that this bit is a 1.”
It seems weird to me because the bits of “33 bits” looks like the same units as the bit of “this bit”, but they aren’t the same. Map/territory. From now on, I’m calling the first, A-bits, and the second, B-bits.
Why does it take an infinite number of bits of evidence to get 1 bit of information?
It takes an infinite number of A-bits to know with absolute certainty one B-bit.
But that means a one-bit evidence-giver is someone who is right 2⁄3 of the time. Why the 2/3? That seems weird.
Basic question about bits of evidence vs. bits of information:
I want to know the value of a random bit. I’m collecting evidence about the value of this bit.
First off, it seems weird to say “I have 33 bits of evidence that this bit is a 1.” What is a bit of evidence, if it takes an infinite number of bits of evidence to get 1 bit of information?
Second, each bit of evidence gives you a likelihood multiplier of 2. E.g., a piece of evidence that says the likelihood is 4:1 that the bit is a 1 gives you 2 bits of evidence about the value of that bit. Independent evidence that says the likelihood is 2:1 gives you 1 bit of evidence.
But that means a one-bit evidence-giver is someone who is right 2⁄3 of the time. Why 2/3?
Finally, if you knew nothing about the bit, and had the probability distribution Q = (P(1)=.5, P(0)=.5), and a one-bit evidence giver gave you 1 bit saying it was a 1, you now have the distribution P = (2/3, 1⁄3). The KL divergence of Q from P (log base 2) is only 0.0817, so it looks like you’ve gained .08 bits of information from your 1 bit of evidence. ???
I think I was wrong to say that 1 bit evidence = likelihood multiplier of 2.
IF you have a signal S, and P(x|S) = 1 while P(x|~S) = .5, then the likelihood multiplier is 2 and you get 1 bit of information, as computed by KL-divergence. That signal did in fact require an infinite amount of evidence to make P(x|S) = 1, I think, so it’s a theoretical signal found only in math problems, like a frictionless surface in physics.
If you have a signal S, and P(x|S) = .5 while P(x|~S) = .25, then the likelihood multiplier is 2, but you get only .2075 bits of information.
There’s a discussion of a similar question on stats.stackexchange.com . It appears that the sum, over a series of observations x, of
log(likelihood ratio = P(x | model 2) / P(x | model 1))
approximates the information gain from changing from model 1 to model 2, but not on a term-by-term basis. The approximation relies on the frequency of the observations in the entire observation series being drawn from a distribution close to model 2.
Yes, there are incompatible uses of the phrase “bits of evidence.” In fact, the likelihood version is not compatible with itself: bits of evidence for Heads is not the same as bits of evidence against Tails. But still it has its place. Odds ratios do have that formal property. You may be interested in this wikipedia article. In that version, a bit of information advantage that you have over the market is the ability to add log(2) to your expected log wealth, betting at the market prices. If you know with certainty the value of the next coin flip, then maybe you can leverage that into arbitrarily large returns, although I think the formalism breaks down at this point.
Why does the likelihood grow exactly twice? (I’m just used to really indirect evidence, which is also seldom binary in the sense that I only get to see whole suits of traits, which usually go together but in some obscure cases, vary in composition. So I guess I have plenty of C-bits that do go in B-bits that might go in A-bits, but how do I measure the change in likelihood of A given C? I know it has to do with d-separation, but if C is something directly observable, like biomass, and B is an abstraction, like species, should I not derive A (an even higher abstraction, like ‘adaptiveness of spending early years in soil’) from C? There are just so much more metrics for C than for B...) Sorry for the ramble, I just felt stupid enough to ask anyway. If you were distracted from answering the parent, please do.
I don’t understand what you’re asking, but I was wrong to say the likelihood grows by 2. See my reply to myself above.
It seems weird to me because the bits of “33 bits” looks like the same units as the bit of “this bit”, but they aren’t the same. Map/territory. From now on, I’m calling the first, A-bits, and the second, B-bits.
It takes an infinite number of A-bits to know with absolute certainty one B-bit.
What were you expecting?