Thanks for your scrutiny :) (and sorry for the long-winded response...)
Let me try to clarify the bottom line of the post:
This post clarifies some subtle points about the ways in which confidence intervals are useful. In the way that a confidence interval is defined mathematically (as far as I understand), without any further axioms, it does not give lots of guarantees. As a side note, the NIH claim seems to be just wrong (and is not what I suppose to be the standard definition the rest of the article is about), and there isn’t any method of attaching confidence intervals that can live up to their claim.
It’s not that we shouldn’t use confidence intervals in any form. But when some practical consequences are drawn conditional on a confidence interval, one has to be wary that there will be some error. In many situations, confidence intervals might be sufficiently “nice” such that these errors are negligible and the conclusions still point in the right direction, but there will be some error, at least in how strong the evidence is regarded to be (except if you don’t just use the definition of a confidence interval but use the narrowness of the interval as an intuitive indicator of the strength of evidence if that’s possible with your given method of attaching confidence intervals, but then you don’t really use that it’s a confidence interval).
Here’s an example of a maliciously constructed confidence interval for the scenario in the post. If more than, say, 90 or less than 10 people from the sample prefer sandwiches, output as confidence interval. If exactly 50 people prefer sandwiches, output . Otherwise, output the interval centered at the mean of the sample and adjust the narrowness to account for the standard deviation. Note that it’s rare to have exactly 50 people prefer sandwiches (a bound independent of q is 8%), so this trick doesn’t worsen the confidence level of the interval too much. If one plans to only act upon clear-cut intervals such as , one will almost always lose when these intervals occur (50:50 will be obtained most of the time when q is near 0.5).
Will something similarly bad but less drastic happen in reality when the confidence interval method is not constructed in a malicious way? When it’s only about rough estimates probably not, but I don’t know yet.
I should probably give the article a question as title. The current title seems a bit too harsh and overshadows my conclusion that confidence intervals seem to be handy while I don’t understand when they are safe to use in practice. In view of the frequent use of confidence intervals in science (and their relevance for calibrated predictions), I’d like to understand how much I can infer from them in which situations. Do you know any good heuristics for this?
I think Zvi calls this a hostile epistemic environment since there are actors that try really hard to produce convincing propaganda. Maybe a helpful heuristic is this: Are there checks and balanches for the media? As far as I know, this is hardly the case in Russia right now since independent media outlets have been shut down and you can be jailed for expressing your sincere opinion. This is a very bad sign. (If there were some kind of freedom of speech, more people would be scrutinizing important claims, so that not hearing these critics would be evidence for the truthfulness of these claims, I guess.) Unfortunately, the EU also started blocking Russion state media outlets and thereby complicating the situation, but still, you don’t have to worry being jailed for expressing a contrarian opinion.
Besides these quick thoughts, I want to propose a framing of the problem. Assume there’s a coin in the world and everybody has high stakes in whether it is fair or biased. Now, different news outlets report what they found out when they flipped the coin themselves. So some report that they got “1000x tails” and others state that their experiments suggest the coin is fair. Maybe they are, technically, both correct in their statements but ignored some coin flips that did not fit into their narrative. [Disclaimer: This doesn’t capture everything of real-world news but gives a feeling for the more complex topics where you build your opinion from lots of tiny pieces of evidence.]
The bottom line is that in a no-trust environment (which exists when people with disjoint trusted sources try to communicate), it’s not possible to settle whether the coin is fair.
A solution that I find, theoretically, especially exciting is adversarial collaboration. You team up with a person of opposite opinion and devise some kind of experiment (or active observation) that helps settle the diagreement. In the above framing, flip the coin several times in the presence of the other person and follow a previously agreed protocol of determining which side is supported by the evidence.
In practice, this is hard. We (most of us) cannot just go to Ukraine (if we’re not already there) to observe what really happens. But what if we think bigger? Imagine thousands of people with diverse opinions of the topic to join forces. They would have a lot more resources to do active observations to reconciliate their differing opinions. For example, as a large group they have better chances to interview important people. If they are honest players, they might agree on a small group of people to travel and actively make observations together. It is also easier for a large group to gather and prominent answers to unsettled questions. The precondition is the honest will to engage with the other side and truthfully settle the disputes.
Unfortunately, this is just a theoretical idea I wasn’t able to test in practice, yet—and it seems hard to imagine to found such an organization in a state where one can be punished for critical inquiry.