Okay, Cyan, I have parsed your posts. I don’t know any statistics whatsoever except what I’ve learned over the last ten hours, but pretty much everything you say seems to be correct, except maybe the last paragraph of this post which still looks foggy to me. The Jean Perrin example in the other comments section was especially illuminating. Let me rephrase it here for the benefit of future readers:
Suppose you’re Jean Perrin trying to determine the value of the Avogadro number. This means you have a family of probability distributions depending on a single parameter, and some numbers that you know were sampled from the distribution with the true parameter value. Now estimate it.
If you’re a frequentist, you calculate a 90% confidence interval for the parameter. Briefly, this means you calculate a couple numbers (“statistics”) from the data—like, y’know, average them and stuff—in such a way that, for any given value of the parameter, if you’d imagined calculating those statistics from random values sampled under this parameter, they’d have a 90% chance of lying on different sides of it. If a billion statisticians do the same, about 90% of them will be right—not much more and not much less. This is, presumably, good calibration.
On the other hand, if you’re a Bayesian, you pick an uninformative prior, then use your samples to morph it into a posterior and get a credible interval. Different priors lead to different intervals and God only knows what proportion out of a billion people like you is going to actually catch the actual Avogadro number with their interval, even though all of you used the credence value of 90%. This is, presumably, poor calibration.
This sounds like an opportune moment to pull a Jaynes and demonstrate conclusively why one side is utterly dumb and the other is forever right, but I don’t yet feel the power. Let’s someone else do that, please? (Eliezer, are you listening?)
The classic answer is that your confidence intervals are liable to occasionally tell you that mass is a negative number, when a large error occurs. Is this interval allowing only negative masses, 90% likely to be correct? No, even if you used an experimental method that a priori was 90% likely to yield an interval covering the correct answer. In other words, using the confidence interval as the posterior probability and plugging it into the expected-utility decision function doesn’t make sense. Frequentists think that ignoring this problem means it goes away.
I already gave Cyan that classic answer, complete with a link to Jaynes, in this very comment thread. :-) But it doesn’t settle the problem completely for me. It feels like finger-pointing. Yes, frequentists have lower quality answers; but why isn’t the average calibration of a billion Bayesians in any way related to that 90% number that they all use?
I pulled a little switcheroo in the Avogadro’s number example: calibration is a property of one agent considering multiple estimation problems, not multiple agents considering one estimation problem. But I think the argument still goes through, i.e., your summary above could be rewritten to take this into account just by changing a few words.
Hmm. I hadn’t noticed that; stupidity strikes again. But regardless of the semantics of the word “calibration”, the property outlined in my summary seems like a nice property to have, and I feel kinda left out for not possessing it.
Okay, Cyan, I have parsed your posts. I don’t know any statistics whatsoever except what I’ve learned over the last ten hours, but pretty much everything you say seems to be correct, except maybe the last paragraph of this post which still looks foggy to me. The Jean Perrin example in the other comments section was especially illuminating. Let me rephrase it here for the benefit of future readers:
Suppose you’re Jean Perrin trying to determine the value of the Avogadro number. This means you have a family of probability distributions depending on a single parameter, and some numbers that you know were sampled from the distribution with the true parameter value. Now estimate it.
If you’re a frequentist, you calculate a 90% confidence interval for the parameter. Briefly, this means you calculate a couple numbers (“statistics”) from the data—like, y’know, average them and stuff—in such a way that, for any given value of the parameter, if you’d imagined calculating those statistics from random values sampled under this parameter, they’d have a 90% chance of lying on different sides of it. If a billion statisticians do the same, about 90% of them will be right—not much more and not much less. This is, presumably, good calibration.
On the other hand, if you’re a Bayesian, you pick an uninformative prior, then use your samples to morph it into a posterior and get a credible interval. Different priors lead to different intervals and God only knows what proportion out of a billion people like you is going to actually catch the actual Avogadro number with their interval, even though all of you used the credence value of 90%. This is, presumably, poor calibration.
This sounds like an opportune moment to pull a Jaynes and demonstrate conclusively why one side is utterly dumb and the other is forever right, but I don’t yet feel the power. Let’s someone else do that, please? (Eliezer, are you listening?)
The classic answer is that your confidence intervals are liable to occasionally tell you that mass is a negative number, when a large error occurs. Is this interval allowing only negative masses, 90% likely to be correct? No, even if you used an experimental method that a priori was 90% likely to yield an interval covering the correct answer. In other words, using the confidence interval as the posterior probability and plugging it into the expected-utility decision function doesn’t make sense. Frequentists think that ignoring this problem means it goes away.
They don’t.
I don’t mean the negative-answer problem. I mean “the confidence interval simply is not the posterior probability full stop” problem.
Well, sure. But whither calibration?
I already gave Cyan that classic answer, complete with a link to Jaynes, in this very comment thread. :-) But it doesn’t settle the problem completely for me. It feels like finger-pointing. Yes, frequentists have lower quality answers; but why isn’t the average calibration of a billion Bayesians in any way related to that 90% number that they all use?
I pulled a little switcheroo in the Avogadro’s number example: calibration is a property of one agent considering multiple estimation problems, not multiple agents considering one estimation problem. But I think the argument still goes through, i.e., your summary above could be rewritten to take this into account just by changing a few words.
Hmm. I hadn’t noticed that; stupidity strikes again. But regardless of the semantics of the word “calibration”, the property outlined in my summary seems like a nice property to have, and I feel kinda left out for not possessing it.