It’s not clear to me whether you’re challenging the validity of uncertainty intervals in general, or just the specific definition given by the NIH. If we’re only talking about the NIH quoted definition, I agree that they make it sound like the 5% of times q falls outside the interval, that q could be anything, and all bets are off, which if true, makes it dangerous to depend on the intervals.
But for uncertainty intervals derived from the basic Bayesian algorithms, e.g. finding the highest-density-interval of a sandwich-tomato Beta distribution, I don’t think that q is uniformly likely to be anything from 0 to 1 in the 5% of cases it falls outside the interval. If I have a 95% Beta interval from 0.9 − 1, in the cases where q falls outside the interval, it’s still more probable that q is 0.8 than that q is 0.2. So if this is what you mean, I don’t agree that basing practical inferences on them requires horrible caution.
It’s also unclear to me when C has the property that intervals near 0 and 1 only occur when q is between 0.4 and 0.6. If an algorithm to produce intervals has this property, it sounds like a bad algorithm with one or more mistakes in it that no one should use. Using the word “confidence interval” to include both this bad algorithm, and better algorithms that avoid the bad property, then concluding that intervals in general are dangerous, is technically a true conclusion, but… not interesting, if that makes sense? Sort of like if someone said “a mathematician can make mistakes in calculations, thereby causing their results to have counterintuitive properties; therefore, be cautious when relying on math.” Totally true! But hearing this, I wouldn’t feel I had learned anything about math.
Do you have an example of the kind of interval calculation that would be more likely to produce intervals near 0 or 1 when q is between 0.4 and 0.6?
Thanks for your scrutiny :) (and sorry for the long-winded response...) Let me try to clarify the bottom line of the post:
This post clarifies some subtle points about the ways in which confidence intervals are useful. In the way that a confidence interval is defined mathematically (as far as I understand), without any further axioms, it does not give lots of guarantees. As a side note, the NIH claim seems to be just wrong (and is not what I suppose to be the standard definition the rest of the article is about), and there isn’t any method of attaching confidence intervals that can live up to their claim.
It’s not that we shouldn’t use confidence intervals in any form. But when some practical consequences are drawn conditional on a confidence interval, one has to be wary that there will be some error. In many situations, confidence intervals might be sufficiently “nice” such that these errors are negligible and the conclusions still point in the right direction, but there will be some error, at least in how strong the evidence is regarded to be (except if you don’t just use the definition of a confidence interval but use the narrowness of the interval as an intuitive indicator of the strength of evidence if that’s possible with your given method of attaching confidence intervals, but then you don’t really use that it’s a confidence interval).
Here’s an example of a maliciously constructed confidence interval for the scenario in the post. If more than, say, 90 or less than 10 people from the sample prefer sandwiches, output [0,1] as confidence interval. If exactly 50 people prefer sandwiches, output [0.9,1]. Otherwise, output the interval centered at the mean of the sample and adjust the narrowness to account for the standard deviation. Note that it’s rare to have exactly 50 people prefer sandwiches (a bound independent of q is 8%), so this trick doesn’t worsen the confidence level of the interval too much. If one plans to only act upon clear-cut intervals such as [0.9,1], one will almost always lose when these intervals occur (50:50 will be obtained most of the time when q is near 0.5).
Will something similarly bad but less drastic happen in reality when the confidence interval method is not constructed in a malicious way? When it’s only about rough estimates probably not, but I don’t know yet.
I should probably give the article a question as title. The current title seems a bit too harsh and overshadows my conclusion that confidence intervals seem to be handy while I don’t understand when they are safe to use in practice. In view of the frequent use of confidence intervals in science (and their relevance for calibrated predictions), I’d like to understand how much I can infer from them in which situations. Do you know any good heuristics for this?
Gotcha—thanks for clarifying and providing the example—it helps!
Everything I know is from the Bayesian way of doing things, so I’m going to talk about uncertainty intervals, which I think are mostly the same as confidence intervals; the main difference, as far as I can tell, is philosophy. (People also call uncertainty intervals “credible intervals” or “credibility intervals”.)
With regard to evaluating the dependability of a given interval, I think it’s important to think about the underlying distribution the interval is being drawn over. I’ve drawn 3 examples in this image: I think you’re worried about situations like the third case (#C). In #C, when q doesn’t fall in the interval, it probably is far from the interval, because the rest of the probability is concentrated in the left & right bounds of the range.
I’m gonna come out strong and say that this can never happen in the tomato-sandwich case, when you use the correct calculations to build the interval. The correct calculations are:
Specify a Beta distribution, B(1, 1) as your prior. (The 1′s can be other numbers; doesn’t change my broader argument).
Because the tomato-sandwich question is isomorphic to a coin flip, the data distribution is most naturally modeled as a Bernoulli. So treat your data as being drawn from a Bernoulli distribution.
Then the posterior distribution is Beta(1 + # tomato, 1 + # sandwich). [Since the Beta and Bernoulli are conjugate, this is always the form of the posterior].
Use either the equal-tails or highest-probability-density method to construct the interval.
Since the posterior distribution is a Beta, and a Beta with a few data points always has exactly one hump, C won’t happen.[1] . So if you know a calculation was done correctly, and that it is modeling a Bernoulli[2] situation, you’re safe—the risks of C won’t be there. (You can play with different Beta distributions easily here to see that nothing like C ever happens).
Things are very often modeled as Gaussian (even things that are technically better-modeled as Beta), and for the Gaussian, it’s the same: one hump, never looks like #C. The intervals here are also well-behaved.
If you’re constructing intervals over the data distribution, then things get weird, yeah. But I don’t think it makes sense to construct intervals over the data distribution; or at the very least, if you do, you are leaving behind some of the safety guarantees of Bayesian calculations like the above. It is hard to imagine what doing so would mean in the tomato-sandwich case: the data is a bunch of “Tomato” and a bunch of “Sandwich”. There are four possible ‘intervals’ here (really they are sets): the one that contains only Tomato, and one that contains only Sandwich, the one that has both, and the one with neither. Other data distributions look more like probability distributions, but even there, going strictly off the data distribution, with no prior or posterior distributions anywhere… yeah, things could definitely get weird.
So maybe one heuristic is: beware of intervals constructed directly on the data distribution. I’ve done this sometimes (actually, often) when I’m lazy and things seem like they’ll be fine, so this is definitely a thing people do! If someone says “we modeled this as a [Gaussian/Beta/Gamma/etc.]”, then they probably have well-behaved calculations going on.
If the data distribution isbimodal, making a two-peaked distribution the approriate posterior, and you use a Gaussian to model it, your conclusions will be way wrong, and your interval will have the kind of problems you’re worried about. But there’s no way to modify the interval-creation algorithm to identify the two modes from a Gaussian posterior; the problem was in choosing to model with a Gaussian in the first place. So I wouldn’t blame the interval algorithm here.
On the other hand, if you do know your posterior is bimodal, model it appropriately, and obtain a two-peaked posterior… hm. I think both the equal-tailed and highest-probability-density intervals would be super-wide, since they would have to stretch over both peaks to get all the density. So this is OK too—your interval isn’t useful, but it would be super-wide, so you’d notice. The real problem is #C, and for posteriors that look like #C, I think you’re totally right—the interval can mislead someone badly, if all they know is the interval and assume it came from something that looks like #A or #B.
Also, AFAIK, the Bayesian calculations for.. anything..? always result in a posterior full probability distribution. So you can always look at the distribution and check if it has some bad #C-like property! Once satisfied it doesn’t, bang, make the interval. But like you say, this doesn’t really help when reading intervals published by other people...
It’s not clear to me whether you’re challenging the validity of uncertainty intervals in general, or just the specific definition given by the NIH. If we’re only talking about the NIH quoted definition, I agree that they make it sound like the 5% of times q falls outside the interval, that q could be anything, and all bets are off, which if true, makes it dangerous to depend on the intervals.
But for uncertainty intervals derived from the basic Bayesian algorithms, e.g. finding the highest-density-interval of a sandwich-tomato Beta distribution, I don’t think that q is uniformly likely to be anything from 0 to 1 in the 5% of cases it falls outside the interval. If I have a 95% Beta interval from 0.9 − 1, in the cases where q falls outside the interval, it’s still more probable that q is 0.8 than that q is 0.2. So if this is what you mean, I don’t agree that basing practical inferences on them requires horrible caution.
It’s also unclear to me when C has the property that intervals near 0 and 1 only occur when q is between 0.4 and 0.6. If an algorithm to produce intervals has this property, it sounds like a bad algorithm with one or more mistakes in it that no one should use. Using the word “confidence interval” to include both this bad algorithm, and better algorithms that avoid the bad property, then concluding that intervals in general are dangerous, is technically a true conclusion, but… not interesting, if that makes sense? Sort of like if someone said “a mathematician can make mistakes in calculations, thereby causing their results to have counterintuitive properties; therefore, be cautious when relying on math.” Totally true! But hearing this, I wouldn’t feel I had learned anything about math.
Do you have an example of the kind of interval calculation that would be more likely to produce intervals near 0 or 1 when q is between 0.4 and 0.6?
Thanks for your scrutiny :) (and sorry for the long-winded response...)
Let me try to clarify the bottom line of the post:
This post clarifies some subtle points about the ways in which confidence intervals are useful. In the way that a confidence interval is defined mathematically (as far as I understand), without any further axioms, it does not give lots of guarantees. As a side note, the NIH claim seems to be just wrong (and is not what I suppose to be the standard definition the rest of the article is about), and there isn’t any method of attaching confidence intervals that can live up to their claim.
It’s not that we shouldn’t use confidence intervals in any form. But when some practical consequences are drawn conditional on a confidence interval, one has to be wary that there will be some error. In many situations, confidence intervals might be sufficiently “nice” such that these errors are negligible and the conclusions still point in the right direction, but there will be some error, at least in how strong the evidence is regarded to be (except if you don’t just use the definition of a confidence interval but use the narrowness of the interval as an intuitive indicator of the strength of evidence if that’s possible with your given method of attaching confidence intervals, but then you don’t really use that it’s a confidence interval).
Here’s an example of a maliciously constructed confidence interval for the scenario in the post. If more than, say, 90 or less than 10 people from the sample prefer sandwiches, output [0,1] as confidence interval. If exactly 50 people prefer sandwiches, output [0.9,1]. Otherwise, output the interval centered at the mean of the sample and adjust the narrowness to account for the standard deviation. Note that it’s rare to have exactly 50 people prefer sandwiches (a bound independent of q is 8%), so this trick doesn’t worsen the confidence level of the interval too much. If one plans to only act upon clear-cut intervals such as [0.9,1], one will almost always lose when these intervals occur (50:50 will be obtained most of the time when q is near 0.5).
Will something similarly bad but less drastic happen in reality when the confidence interval method is not constructed in a malicious way? When it’s only about rough estimates probably not, but I don’t know yet.
I should probably give the article a question as title. The current title seems a bit too harsh and overshadows my conclusion that confidence intervals seem to be handy while I don’t understand when they are safe to use in practice. In view of the frequent use of confidence intervals in science (and their relevance for calibrated predictions), I’d like to understand how much I can infer from them in which situations. Do you know any good heuristics for this?
Gotcha—thanks for clarifying and providing the example—it helps!
Everything I know is from the Bayesian way of doing things, so I’m going to talk about uncertainty intervals, which I think are mostly the same as confidence intervals; the main difference, as far as I can tell, is philosophy. (People also call uncertainty intervals “credible intervals” or “credibility intervals”.)
With regard to evaluating the dependability of a given interval, I think it’s important to think about the underlying distribution the interval is being drawn over. I’ve drawn 3 examples in this image: I think you’re worried about situations like the third case (#C). In #C, when q doesn’t fall in the interval, it probably is far from the interval, because the rest of the probability is concentrated in the left & right bounds of the range.
I’m gonna come out strong and say that this can never happen in the tomato-sandwich case, when you use the correct calculations to build the interval. The correct calculations are:
Specify a Beta distribution, B(1, 1) as your prior. (The 1′s can be other numbers; doesn’t change my broader argument).
Because the tomato-sandwich question is isomorphic to a coin flip, the data distribution is most naturally modeled as a Bernoulli. So treat your data as being drawn from a Bernoulli distribution.
Then the posterior distribution is Beta(1 + # tomato, 1 + # sandwich). [Since the Beta and Bernoulli are conjugate, this is always the form of the posterior].
Use either the equal-tails or highest-probability-density method to construct the interval.
Since the posterior distribution is a Beta, and a Beta with a few data points always has exactly one hump, C won’t happen.[1] . So if you know a calculation was done correctly, and that it is modeling a Bernoulli[2] situation, you’re safe—the risks of C won’t be there. (You can play with different Beta distributions easily here to see that nothing like C ever happens).
Things are very often modeled as Gaussian (even things that are technically better-modeled as Beta), and for the Gaussian, it’s the same: one hump, never looks like #C. The intervals here are also well-behaved.
If you’re constructing intervals over the data distribution, then things get weird, yeah. But I don’t think it makes sense to construct intervals over the data distribution; or at the very least, if you do, you are leaving behind some of the safety guarantees of Bayesian calculations like the above. It is hard to imagine what doing so would mean in the tomato-sandwich case: the data is a bunch of “Tomato” and a bunch of “Sandwich”. There are four possible ‘intervals’ here (really they are sets): the one that contains only Tomato, and one that contains only Sandwich, the one that has both, and the one with neither. Other data distributions look more like probability distributions, but even there, going strictly off the data distribution, with no prior or posterior distributions anywhere… yeah, things could definitely get weird.
So maybe one heuristic is: beware of intervals constructed directly on the data distribution. I’ve done this sometimes (actually, often) when I’m lazy and things seem like they’ll be fine, so this is definitely a thing people do! If someone says “we modeled this as a [Gaussian/Beta/Gamma/etc.]”, then they probably have well-behaved calculations going on.
If the data distribution is bimodal, making a two-peaked distribution the approriate posterior, and you use a Gaussian to model it, your conclusions will be way wrong, and your interval will have the kind of problems you’re worried about. But there’s no way to modify the interval-creation algorithm to identify the two modes from a Gaussian posterior; the problem was in choosing to model with a Gaussian in the first place. So I wouldn’t blame the interval algorithm here.
On the other hand, if you do know your posterior is bimodal, model it appropriately, and obtain a two-peaked posterior… hm. I think both the equal-tailed and highest-probability-density intervals would be super-wide, since they would have to stretch over both peaks to get all the density. So this is OK too—your interval isn’t useful, but it would be super-wide, so you’d notice. The real problem is #C, and for posteriors that look like #C, I think you’re totally right—the interval can mislead someone badly, if all they know is the interval and assume it came from something that looks like #A or #B.
Also, AFAIK, the Bayesian calculations for.. anything..? always result in a posterior full probability distribution. So you can always look at the distribution and check if it has some bad #C-like property! Once satisfied it doesn’t, bang, make the interval. But like you say, this doesn’t really help when reading intervals published by other people...
Beta distributions generally look like the distribution in B, and can look like A (Gaussian) with a lot of data, when q is not too close to 0 or 1.
Very common—all “did they get better or not, yes or no” medical trials are like this, for example.