In front of you is a bookbag containing 1,000 poker chips. I started out with two such bookbags, one containing 700 red and 300 blue chips, the other containing 300 red and 700 blue. I flipped a fair coin to determine which bookbag to use, so your prior probability that the bookbag in front of you is the red bookbag is 50%. Now, you sample randomly, with replacement after each chip. In 12 samples, you get 8 reds and 4 blues. What is the probability that this is the predominantly red bag?
… a blue chip is exactly the same amount of evidence as a red chip, just in the other direction … If you draw one blue chip and one red chip, they cancel out. So the ratio of red chips to blue chips does not matter; only the excess of red chips over blue chips matters. There were eight red chips and four blue chips in twelve samples; therefore, four more red chips than blue chips. …
We can now see intuitively that the bookbag problem would have exactly the same answer, obtained in just the same way, if sixteen chips were sampled and we found ten red chips and six blue chips.
Did I misunderstand something, or does the last quoted sentence contradict the whole previous explanation? Please check after me.
If I am correct and this is a mistake, it should be fixed both on Eliezer’s page and in the Sequences ebook.
No, it looks perfectly fine to me; “8 reds and 4 blues” is the same evidence as “10 red and 6 blues”, or for that matter, as “104 reds and 100 blues” (in that context) - what counts is the difference, not the ratio.
Intuitively, I would be pretty ready to bet that I know the correct bookbag if I pulled out 5 red chips and 1 blue. 97% seems a fine level of confidence.
But if we get 1,000,004 red and 1,000,000 blues, I doubt I’d be so sure. It seems pretty obvious to me that you should be somewhere close to 50⁄50 because you’re clearly getting random data. To say that you could be 97% confident is insane.
I concede that you’re getting screwed over by the multi-verse at that point, but there’s got to be some accounting for ratio. There is no way that you should be equally confident in your guess regardless of if you receive ratios of 5:1, 10:6, 104:100, or 1000004:1000000.
What getting a ratio of 1000004:1000000 tells you is that you’re looking at the wrong hypotheses.
If you know absolutely-for-sure (because God told you, and God never lies) that you have either a (700,300) bag or a (300,700) bag and are sampling whichever bag it is uniformly and independently, and the only question is which of those two situations you’re in, then the evidence does indeed favour the (700,300) bag by the same amount as it would if your draws were (8,4) instead of (1000004,1000000).
But the probability of getting anything like those numbers in either case is incredibly tiny and long before getting to (1000004,1000000) you should have lost your faith in what God told you. Your bag contains some other numbers of chips, or you’re drawing from it in some weirdly correlated way, or the devil is screwing with your actions or perceptions.
(“Somewhere close to 50:50” is correct in the following sense: if you start with any sensible probability distribution over the number of chips in the bags that does allow something much nearer to equality, then Pr((700,300)) and Pr((300,700)) are far closer to one another than either is to Pr(somewhere nearer to equality) and the latter is what you should be focusing on because you clearly don’t really have either (700,300) or (300,700).)
I agree that at 1000004:1000000, you’re looking at the wrong hypothesis. But in the above example, 104:100, you’re looking at the wrong hypothesis too. It’s just that a factor of 10,000x makes it easier to spot. In fact, at 34:30 or even a fewer number of iterations, you’re probably also getting the wrong hypothesis.
A single percentage point of doubt gets blown up and multiplied, but that percentage point has to come from somewhere. It can’t just spring forth from nothingness once you get to past 50 iterations. That means you can’t be 96.6264% certain at the start, but just a little lower (Eliezer’s pre-rounding certainty).
The real question in my mind is when that 1% of doubt actually becomes a significant 5%->10%->20% that something’s wrong. 8:4 feels fine. 104:100 feels overwhelming. But how much doubt am I supposed to feel at 10:6 or at 18:14?
How do you even calculate that if there’s no allowance in the original problem?
There should always, really, be “allowance in the original problem”. Perhaps not explicitly factored in, but you should assign some nonzero probability to possibilities like “the experimenter lied to me”, “I goofed in some crazy way”, “I am being deceived by malevolent demons”, etc. In practice, these wacky hypotheses may not occur to you until the evidence for them starts getting large, and you can decide at that point what prior probabilities you should have put on them. (Unfortunately it’s easy to do that wrongly, e.g. because of hindsight bias.)
As Douglas_Knight says, frequentist statistics is full of tests that will tell you when some otherwise plausible hypothesis (e.g., “these two samples are drawn from things with the same probability distribution”) are incompatible with the data in particular (or not-so-particular) ways.
I concede that you’re getting screwed over by the multi-verse at that point, but there’s got to be some accounting for ratio. There is no way that you should be equally confident in your guess regardless of if you receive ratios of 5:1, 10:6, 104:100, or 1000004:1000000.
Yeah, that’s why I added “(in that context)”—i.e. we are 100% sure that those two hypotheses are the only one. If there’s even a 0.01% chance that the distribution could be 50% / 50% (as is likely in the real world), then that hypothesis is going to become way more likely.
There’s actually some really cool math developed about situations like this one. Large deviation theory describes how occurrences like the 1,000,004 red / 1,000,000 blues one become unlikely at an exponential rate and how, conditioning on them occurring, information about the manner in which they occurred can be deduced. It’s a sort of trivial conclusion in this case, but if we accept a principle of maximum entropy, we can be dead certain that any of the 2,000,004 red or blue draws looks marginally like a Bernoulli with 1,000,004:1,000,000 odds. That’s just the likeliest way (outside of our setup being mistaken) of observing our extremely unlikely outcome.
Thanks a lot! Somehow I read something else than was actually written there, and repeated readings didn’t help. When you wrote it using digits, I realized the confusion.
(Specifically, it was: “eight and four” vs “sixteen and ten”. Yeah, those words are there, but in a different context.)
From “Bayes’ Theorem”:
Did I misunderstand something, or does the last quoted sentence contradict the whole previous explanation? Please check after me.
If I am correct and this is a mistake, it should be fixed both on Eliezer’s page and in the Sequences ebook.
No, it looks perfectly fine to me; “8 reds and 4 blues” is the same evidence as “10 red and 6 blues”, or for that matter, as “104 reds and 100 blues” (in that context) - what counts is the difference, not the ratio.
Surely that can’t be correct.
Intuitively, I would be pretty ready to bet that I know the correct bookbag if I pulled out 5 red chips and 1 blue. 97% seems a fine level of confidence.
But if we get 1,000,004 red and 1,000,000 blues, I doubt I’d be so sure. It seems pretty obvious to me that you should be somewhere close to 50⁄50 because you’re clearly getting random data. To say that you could be 97% confident is insane.
I concede that you’re getting screwed over by the multi-verse at that point, but there’s got to be some accounting for ratio. There is no way that you should be equally confident in your guess regardless of if you receive ratios of 5:1, 10:6, 104:100, or 1000004:1000000.
What getting a ratio of 1000004:1000000 tells you is that you’re looking at the wrong hypotheses.
If you know absolutely-for-sure (because God told you, and God never lies) that you have either a (700,300) bag or a (300,700) bag and are sampling whichever bag it is uniformly and independently, and the only question is which of those two situations you’re in, then the evidence does indeed favour the (700,300) bag by the same amount as it would if your draws were (8,4) instead of (1000004,1000000).
But the probability of getting anything like those numbers in either case is incredibly tiny and long before getting to (1000004,1000000) you should have lost your faith in what God told you. Your bag contains some other numbers of chips, or you’re drawing from it in some weirdly correlated way, or the devil is screwing with your actions or perceptions.
(“Somewhere close to 50:50” is correct in the following sense: if you start with any sensible probability distribution over the number of chips in the bags that does allow something much nearer to equality, then Pr((700,300)) and Pr((300,700)) are far closer to one another than either is to Pr(somewhere nearer to equality) and the latter is what you should be focusing on because you clearly don’t really have either (700,300) or (300,700).)
Maybe I should back up a bit.
I agree that at 1000004:1000000, you’re looking at the wrong hypothesis. But in the above example, 104:100, you’re looking at the wrong hypothesis too. It’s just that a factor of 10,000x makes it easier to spot. In fact, at 34:30 or even a fewer number of iterations, you’re probably also getting the wrong hypothesis.
A single percentage point of doubt gets blown up and multiplied, but that percentage point has to come from somewhere. It can’t just spring forth from nothingness once you get to past 50 iterations. That means you can’t be 96.6264% certain at the start, but just a little lower (Eliezer’s pre-rounding certainty).
The real question in my mind is when that 1% of doubt actually becomes a significant 5%->10%->20% that something’s wrong. 8:4 feels fine. 104:100 feels overwhelming. But how much doubt am I supposed to feel at 10:6 or at 18:14?
How do you even calculate that if there’s no allowance in the original problem?
There should always, really, be “allowance in the original problem”. Perhaps not explicitly factored in, but you should assign some nonzero probability to possibilities like “the experimenter lied to me”, “I goofed in some crazy way”, “I am being deceived by malevolent demons”, etc. In practice, these wacky hypotheses may not occur to you until the evidence for them starts getting large, and you can decide at that point what prior probabilities you should have put on them. (Unfortunately it’s easy to do that wrongly, e.g. because of hindsight bias.)
As Douglas_Knight says, frequentist statistics is full of tests that will tell you when some otherwise plausible hypothesis (e.g., “these two samples are drawn from things with the same probability distribution”) are incompatible with the data in particular (or not-so-particular) ways.
Frequentist tests are good here.
Yeah, that’s why I added “(in that context)”—i.e. we are 100% sure that those two hypotheses are the only one. If there’s even a 0.01% chance that the distribution could be 50% / 50% (as is likely in the real world), then that hypothesis is going to become way more likely.
There’s actually some really cool math developed about situations like this one. Large deviation theory describes how occurrences like the 1,000,004 red / 1,000,000 blues one become unlikely at an exponential rate and how, conditioning on them occurring, information about the manner in which they occurred can be deduced. It’s a sort of trivial conclusion in this case, but if we accept a principle of maximum entropy, we can be dead certain that any of the 2,000,004 red or blue draws looks marginally like a Bernoulli with 1,000,004:1,000,000 odds. That’s just the likeliest way (outside of our setup being mistaken) of observing our extremely unlikely outcome.
Thanks a lot! Somehow I read something else than was actually written there, and repeated readings didn’t help. When you wrote it using digits, I realized the confusion.
(Specifically, it was: “eight and four” vs “sixteen and ten”. Yeah, those words are there, but in a different context.)