Since I don’t want this to spiral into another stopping rule argument, allow me to try and dissolve a confusing point that the discussions get stuck on.
What makes Bayesian “lose” in the cases proposed by Mayo and Simonsohn isn’t the inference, it’s the scoring rule. A Bayesian scores himself on total calibration, “number of times my 95% confidence interval includes the truth” is just a small part of it. You can generate an experiment that has a high chance (let’s say 99%) of making a Bayesian have a 20:1 likelihood ratio in favor of some hypothesis. By conservation of expected evidence, the same experiment might have 1% chance of generating close to a 2000:1 likelihood ratio against that same hypothesis. A frequentist could never be as sure of anything, this occasional 2000:1 confidence is the Bayesian’s reward. If you rig the rules to view something about 95% confidence intervals as the only measure of success, then the frequentist’s decisionrule about accepting hypotheses at a 5% p-value wins, it’s not his inference that magically becomes superior.
Allow me to steal an analogy from my friend Simon: I’m running a Bayesian Casino in Vegas. Debrah Mayo comes to my casino every day with $31. She bets $1 on a coin flip, then bets $2 if she loses, then $4 and so on until she either wins $1 or loses all $31 if 5 flips go against her. I obviously think that by conservation of expected money in a coin flip this deal is fair, but Prof. Mayo tells me that I’m a sucker because I lose more days that I win. I tell her that I care about dollars, not days, but she replies that if she had more money in her pocket, she could make sure I have a losing day with arbitrarily high probability! I smile and ask her if she wants a drink.
You can generate an experiment that has a high chance (let’s say 99%) of making a Bayesian have a 20:1 likelihood ratio in favor of some hypothesis.
This is wrong, unless I’ve misunderstood you. Imagine the prior for hypothesis H is p, hence the prior for ~H is 1-p. If you have a 99% chance of generating a 20:1 likelihood for H, then your prior must be bounded below by .99*(20p/19p+1). (The second term is the posterior for H if you have a 20:1 likelihood). So we have the inequality p> .99*(20p/19p+1), which I was lazy and used http://www.wolframalpha.com/input/?i=p%3E+.99*%2820p%29%2F%2819p%2B1%29%2C+0%3Cp%3C1 to solve, which tells me that p must be at least 0.989474.
So you can only expect to generate strong evidence for a hypothesis if you’re already pretty sure of it, which is just as it should be.
I may have bungled these calculations, doing them quickly, though.
That’s exactly what I used it for in my calculation, I didn’t misunderstand that. Your computation of “conservation of expected evidence” simply does not work unless your prior is extremely high to begin with. Put simply, you cannot be 99% sure that you’ll later change your current belief in H of p to anything greater than 100*p/99, which places a severe lower bound on p for a likelihood ratio of 20:1.
Yes! It worked! I learned something by getting embarrassed online!!!
ike, you’re absolutely correct. I applied conservation of expected evidence to likelihood ratios instead of to posterior probabilities, and thus didn’t realize that the prior puts bounds on expected likelihood ratios. This also means that the numbers I suggested (1% of 1:2000, 99% of 20:1) define the prior precisely at 98.997%.
I’m going to leave the fight to defend the reputation of Bayesian inference to you and go do some math exercises.
A Bayesian scores himself on total calibration, “number of times my 95% confidence interval includes the truth” is just a small part of it. You can generate an experiment that has a high chance (let’s say 99%) of making a Bayesian have a 20:1 likelihood ratio in favor of some hypothesis. By conservation of expected evidence, the same experiment might have 1% chance of generating close to a 2000:1 likelihood ratio against that same hypothesis. A frequentist could never be as sure of anything, this occasional 2000:1 confidence is the Bayesian’s reward.
Hold on. Let’s say I hire a Bayesian statistician to produce some estimate for me. I do not care about “scoring” or “reward”, all I care about is my estimate and how accurate it is. Now you are going to tell me that in 99% of the cases your estimate will be wrong and that’s fine because there is a slight chance that you’ll be really really sure of the opposite conclusion?
I’m running a Bayesian Casino in Vegas. Debrah Mayo comes to my casino every day with $31.
Why, that’s such a frequentist approach X-/
Let’s change the situation slightly. You are running the Bayesian Casino and Debrah Mayo comes to you casino once with, say, $1023 in her pocket. Will I lend you money to bet against her? No, I will not. The distribution matters beyond simple expected means.
van Nostrand: Of course. I remember each problem quite clearly. And I recall that on
each occasion I was quite thorough. I interrogated you in detail, determined your model
and prior and produced a coherent 95 percent interval for the quantity of interest.
Pennypacker: Yes indeed. We did this many times and I paid you quite handsomely.
van Nostrand: Well earned money I’d say. And it helped win you that Nobel.
Pennypacker: Well they retracted the Nobel and they took away my retirement savings.
…
van Nostrand: Whatever are you talking about?
Pennypacker: You see, physics has really advanced. All those quantities I estimated
have now been measured to great precision. Of those thousands of 95 percent intervals,
only 3 percent contained the true values! They concluded I was a fraud.
van Nostrand: Pennypacker you fool. I never said those intervals would contain the
truth 95 percent of the time. I guaranteed coherence not coverage!
Now you are going to tell me that in 99% of the cases your estimate will be wrong
No. Your calibration is still perfect if your priors are perfect. You can only get to that “99% chance of getting strong evidence for hypothesis” if you’re already very sure of that hypothesis math here
What makes Bayesian “lose” in the cases proposed by Mayo and Simonsohn isn’t the inference, it’s the scoring rule. A Bayesian scores himself on total calibration, “number of times my 95% confidence interval includes the truth” is just a small part of it. You can generate an experiment that has a high chance (let’s say 99%) of making a Bayesian have a 20:1 likelihood ratio in favor of some hypothesis. By conservation of expected evidence, the same experiment might have 1% chance of generating close to a 2000:1 likelihood ratio against that same hypothesis. A frequentist could never be as sure of anything, this occasional 2000:1 confidence is the Bayesian’s reward. If you rig the rules to view something about 95% confidence intervals as the only measure of success, then the frequentist’s decision rule about accepting hypotheses at a 5% p-value wins, it’s not his inference that magically becomes superior.
Sometimes we might care about “total calibration” I guess, but sometimes we care about being actually calibrated in the rationalist sense. Sometimes we want a 95% confidence interval to mean that doing this 100 times will include the true value about 95 times.
My point was this idea that the stopping rule doesn’t matter is more complicated than calculating a Bayes factor and saying “look, the stopping rule doesn’t change the Bayes factor.”
My point was this idea that the stopping rule doesn’t matter is more complicated than calculating a Bayes factor and saying “look, the stopping rule doesn’t change the Bayes factor.”
The stopping rule won’t change the expectation of the Bayes factor.
Sometimes we want a 95% confidence interval to mean that doing this 100 times will include the true value about 95 times.
If your prior is correct, then your 95% credibility interval will, in fact, be well calibrated and be correct 95% of the time. I argued at length on tumblr that most or all of the force of the stopping rule objection to Bayes is a disguised “you have a bad prior” situation. If you’re willing to ask the question that way, you can generate similar cases without stopping rules as well. For instance, imagine there are two kinds of coins; ones that land on heads 100% of the time, and ones that land on heads 20% of the time. (The rest are tails.) You have one flip with the coin. Oh, one more thing: I tell you that there are 1 billion coins of the first kind, and only one of the second kind.
You flip the coin once. It’s easy to show that there’s an overwhelming probability of getting a 20:1 likelihood in favor of the first coin. Why is this problematic?
I can and have given a similar case for 95% credibility intervals as opposed to Bayes factors, which I’ll write out if you’re interested.
Since I don’t want this to spiral into another stopping rule argument, allow me to try and dissolve a confusing point that the discussions get stuck on.
What makes Bayesian “lose” in the cases proposed by Mayo and Simonsohn isn’t the inference, it’s the scoring rule. A Bayesian scores himself on total calibration, “number of times my 95% confidence interval includes the truth” is just a small part of it. You can generate an experiment that has a high chance (let’s say 99%) of making a Bayesian have a 20:1 likelihood ratio in favor of some hypothesis. By conservation of expected evidence, the same experiment might have 1% chance of generating close to a 2000:1 likelihood ratio against that same hypothesis. A frequentist could never be as sure of anything, this occasional 2000:1 confidence is the Bayesian’s reward. If you rig the rules to view something about 95% confidence intervals as the only measure of success, then the frequentist’s decision rule about accepting hypotheses at a 5% p-value wins, it’s not his inference that magically becomes superior.
Allow me to steal an analogy from my friend Simon: I’m running a Bayesian Casino in Vegas. Debrah Mayo comes to my casino every day with $31. She bets $1 on a coin flip, then bets $2 if she loses, then $4 and so on until she either wins $1 or loses all $31 if 5 flips go against her. I obviously think that by conservation of expected money in a coin flip this deal is fair, but Prof. Mayo tells me that I’m a sucker because I lose more days that I win. I tell her that I care about dollars, not days, but she replies that if she had more money in her pocket, she could make sure I have a losing day with arbitrarily high probability! I smile and ask her if she wants a drink.
This is wrong, unless I’ve misunderstood you. Imagine the prior for hypothesis H is p, hence the prior for ~H is 1-p. If you have a 99% chance of generating a 20:1 likelihood for H, then your prior must be bounded below by .99*(20p/19p+1). (The second term is the posterior for H if you have a 20:1 likelihood). So we have the inequality p> .99*(20p/19p+1), which I was lazy and used http://www.wolframalpha.com/input/?i=p%3E+.99*%2820p%29%2F%2819p%2B1%29%2C+0%3Cp%3C1 to solve, which tells me that p must be at least 0.989474.
So you can only expect to generate strong evidence for a hypothesis if you’re already pretty sure of it, which is just as it should be.
I may have bungled these calculations, doing them quickly, though.
Edit: removed for misunderstanding ike’s question and giving an irrelevant answer. Huge thanks to ike for teaching me math.
That’s exactly what I used it for in my calculation, I didn’t misunderstand that. Your computation of “conservation of expected evidence” simply does not work unless your prior is extremely high to begin with. Put simply, you cannot be 99% sure that you’ll later change your current belief in H of p to anything greater than 100*p/99, which places a severe lower bound on p for a likelihood ratio of 20:1.
Yes! It worked! I learned something by getting embarrassed online!!!
ike, you’re absolutely correct. I applied conservation of expected evidence to likelihood ratios instead of to posterior probabilities, and thus didn’t realize that the prior puts bounds on expected likelihood ratios. This also means that the numbers I suggested (1% of 1:2000, 99% of 20:1) define the prior precisely at 98.997%.
I’m going to leave the fight to defend the reputation of Bayesian inference to you and go do some math exercises.
Hold on. Let’s say I hire a Bayesian statistician to produce some estimate for me. I do not care about “scoring” or “reward”, all I care about is my estimate and how accurate it is. Now you are going to tell me that in 99% of the cases your estimate will be wrong and that’s fine because there is a slight chance that you’ll be really really sure of the opposite conclusion?
Why, that’s such a frequentist approach X-/
Let’s change the situation slightly. You are running the Bayesian Casino and Debrah Mayo comes to you casino once with, say, $1023 in her pocket. Will I lend you money to bet against her? No, I will not. The distribution matters beyond simple expected means.
Reminds of this bit from a Wasserman paper http://ba.stat.cmu.edu/journal/2006/vol01/issue03/wasserman.pdf
No. Your calibration is still perfect if your priors are perfect. You can only get to that “99% chance of getting strong evidence for hypothesis” if you’re already very sure of that hypothesis math here
Sometimes we might care about “total calibration” I guess, but sometimes we care about being actually calibrated in the rationalist sense. Sometimes we want a 95% confidence interval to mean that doing this 100 times will include the true value about 95 times.
My point was this idea that the stopping rule doesn’t matter is more complicated than calculating a Bayes factor and saying “look, the stopping rule doesn’t change the Bayes factor.”
The stopping rule won’t change the expectation of the Bayes factor.
If your prior is correct, then your 95% credibility interval will, in fact, be well calibrated and be correct 95% of the time. I argued at length on tumblr that most or all of the force of the stopping rule objection to Bayes is a disguised “you have a bad prior” situation. If you’re willing to ask the question that way, you can generate similar cases without stopping rules as well. For instance, imagine there are two kinds of coins; ones that land on heads 100% of the time, and ones that land on heads 20% of the time. (The rest are tails.) You have one flip with the coin. Oh, one more thing: I tell you that there are 1 billion coins of the first kind, and only one of the second kind.
You flip the coin once. It’s easy to show that there’s an overwhelming probability of getting a 20:1 likelihood in favor of the first coin. Why is this problematic?
I can and have given a similar case for 95% credibility intervals as opposed to Bayes factors, which I’ll write out if you’re interested.